<a href="https://colab.research.google.com/github/Derrick287/Data-Analysis-with-Python/blob/main/Pandas/Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. <h1>Introduction to key Pandas data structures: Series and DataFrame</h1>

1. Series:
   - A Series is a one-dimensional labeled array capable of holding any data type (e.g., integers, strings, floats, etc.).
   - It can be thought of as a column in a spreadsheet or a single column of data in a table.
   - Series objects consist of two main components: the index and the data.
   - The index provides labels for each element in the Series, allowing for efficient data retrieval and alignment.
   - The data component stores the actual values of the Series.
2. DataFrame:
   - A DataFrame is a two-dimensional labeled data structure with columns of potentially different data types.
   - It can be thought of as a table or a spreadsheet with rows and columns.
   - DataFrames are the primary data structure used in Pandas and are highly versatile for data analysis.
   - Each column in a DataFrame is a Series object, sharing the same index.
   - DataFrames have both row and column indexes, enabling flexible data access and manipulation.
Key Differences between Series and DataFrame:
1. Dimensionality:
   - Series is one-dimensional, representing a single column of data.
   - DataFrame is two-dimensional, representing a table with rows and columns.
2. Structure:
   - Series has a single index, providing labels for each element in the series.
   - DataFrame has both row and column indexes, allowing for efficient data alignment and manipulation.
3. Data Types:
   - Series can hold data of any type, such as integers, strings, floats, etc.
   - DataFrame allows for different data types in different columns.
4. Relationship:
   - A DataFrame is a collection of Series objects that share the same index.
   - Each column in a DataFrame can be accessed as a separate Series.
Use Cases:
1. Series:
   - Working with a single column of data, such as stock prices, temperature readings, or sales figures.
   - Performing operations or calculations on a specific column of data.
2. DataFrame:
   - Handling tabular data with multiple columns, such as a dataset with observations across different variables.
   - Performing data manipulation, analysis, and cleaning tasks on structured data.
   - Merging, joining, and aggregating data from different sources.



2. <h1>Working with Series</h1>

<h2>Creating a Series</h2>
You can create a Series in Pandas using the <b>pd.Series()</b> function.

In [None]:
import pandas as pd

# Creating a Series from a list
data = [10, 20, 30, 40, 50]
series = pd.Series(data)
print(series)


0    10
1    20
2    30
3    40
4    50
dtype: int64


<h2>Indexing and slicing Series</h2>
Series objects have an index associated with each element, allowing you to access and slice data. You can use various indexing techniques, including integer-based indexing, label-based indexing, and boolean indexing.

In [None]:
# Integer-based indexing
print(series[0])     # Accessing the first element
print(series[2:4])   # Slicing from index 2 to 4 (exclusive)

# Label-based indexing
index = ['A', 'B', 'C', 'D', 'E']
series = pd.Series(data, index=index)
print(series['B'])   # Accessing element by label

# Boolean indexing
print(series[series > 30])   # Selecting elements based on a condition


10
2    30
3    40
dtype: int64
20
D    40
E    50
dtype: int64


<h2>Performing basic operations on Series (e.g., arithmetic operations)</h2>
You can perform arithmetic operations on Series, such as addition, subtraction, multiplication, and division. The operations are applied element-wise, based on the index alignment.

In [None]:
series1 = pd.Series([1, 2, 3])
series2 = pd.Series([10, 20, 30])

# Addition
result = series1 + series2
print(result)

# Subtraction
result = series2 - series1
print(result)

# Multiplication
result = series1 * series2
print(result)

# Division
result = series2 / series1
print(result)


0    11
1    22
2    33
dtype: int64
0     9
1    18
2    27
dtype: int64
0    10
1    40
2    90
dtype: int64
0    10.0
1    10.0
2    10.0
dtype: float64


<h2>Handling missing data in Series</h2>
Pandas provides various methods to handle missing or NaN (Not a Number) data in a Series. Some commonly used methods include <b> notnull()</b>, <b> isnull()</b>, and <b>fillna()</b>.

In [None]:
series = pd.Series([10, 20, None, 40, None])

# Checking for missing values
print(series.isnull())

# Checking for non-missing values
print(series.notnull())

# Filling missing values with a specific value
filled_series = series.fillna(0)
print(filled_series)


0    False
1    False
2     True
3    False
4     True
dtype: bool
0     True
1     True
2    False
3     True
4    False
dtype: bool
0    10.0
1    20.0
2     0.0
3    40.0
4     0.0
dtype: float64


<h2>Applying functions and methods to Series</h2>
Pandas provides a wide range of built-in functions and methods that can be applied to Series. These include mathematical functions (<b>sum(), mean(), min(), max()</b>), statistical functions (<b>describe(), std(), count()</b>), and many more.

In [None]:
series = pd.Series([1, 2, 3, 4, 5])

# Applying mathematical functions
print("sum: ", series.sum())
print("mean: ", series.mean())
print("min: ", series.min())
print("max: ", series.max())
print()
# Applying statistical functions
print("Describe: \n", series.describe())
print()
print("std: ", series.std())
print("count: ", series.count())


sum:  15
mean:  3.0
min:  1
max:  5

Describe: 
 count    5.000000
mean     3.000000
std      1.581139
min      1.000000
25%      2.000000
50%      3.000000
75%      4.000000
max      5.000000
dtype: float64

std:  1.5811388300841898
count:  5


3. <h1>Introduction to DataFrames</h1>

<h2>Creating a DataFrame:</h2>
To create a DataFrame in Pandas, you can use the pd.DataFrame() function. There are several ways to create a DataFrame:<br>
a. <b>From a dictionary</b>: Each key-value pair represents a column, and the values can be lists, arrays, or Series.<br>
b. <b>From a 2D array or a nested list</b>: Each nested list represents a row in the DataFrame.<br>
c. <b>From a CSV, Excel file, or other data sources.</b>

In [None]:
import pandas as pd

data = {
    'Name': ['John', 'Alice', 'Bob'],
    'Age': [25, 30, 35],
    'City': ['New York', 'London', 'Paris']
}

df = pd.DataFrame(data)
print(df)


    Name  Age      City
0   John   25  New York
1  Alice   30    London
2    Bob   35     Paris


<h2>Loading data into a DataFrame</h2>
Pandas provides functions to load data from various file formats, including CSV, Excel, JSON, SQL databases, and more. Here's an example of loading data from a CSV file:

In [None]:
import pandas as pd

df = pd.read_csv('data.csv')
print(df)


<h2>Basic DataFrame operations: indexing, slicing, and subsetting:</h2>
DataFrames allow for various operations to access and manipulate data. You can perform indexing, slicing, and subsetting to extract specific rows or columns from the DataFrame.

In [None]:
# Indexing columns
age_column = df['Age']  # Accessing a specific column
print(age_column)
print("---------------------------------------------------------")
# Slicing rows
subset = df[1:3]  # Selecting rows 1 and 2
print(subset)
print("---------------------------------------------------------")
# Subsetting based on conditions
subset = df[df['Age'] > 30]  # Selecting rows where Age is greater than 30
print(subset)


0    25
1    30
2    35
Name: Age, dtype: int64
---------------------------------------------------------
    Name  Age    City
1  Alice   30  London
2    Bob   35   Paris
---------------------------------------------------------
  Name  Age   City
2  Bob   35  Paris


<h2>Descriptive statistics for DataFrames</h2>
Pandas provides several methods to compute descriptive statistics for DataFrames, such as <b>describe(), mean(), min(), max(), count(), std(),</b> and more. These methods offer insights into the central tendency, dispersion, and other summary statistics of the data.

In [None]:
# Summary statistics
summary = df.describe()
print(summary)
print("---------------------------------------------------------")
# Computing mean of a specific column
age_mean = df['Age'].mean()
print(age_mean)


        Age
count   3.0
mean   30.0
std     5.0
min    25.0
25%    27.5
50%    30.0
75%    32.5
max    35.0
---------------------------------------------------------
30.0


<h2>Handling missing data in DataFrames</h2>
Missing data is a common occurrence in real-world datasets. Pandas offers methods to handle missing values, such as <b>isnull(), notnull(), fillna(),</b> and <b>dropna()</b>.

In [None]:
# Checking for missing values
print(df.isnull())
print("---------------------------------------------------------")
# Checking for non-missing values
print(df.notnull())
print("---------------------------------------------------------")
# Filling missing values with a specific value
filled_df = df.fillna(0)
print(filled_df)
print("---------------------------------------------------------")
# Dropping rows with missing values
cleaned_df = df.dropna()
print(cleaned_df)


    Name    Age   City
0  False  False  False
1  False  False  False
2  False  False  False
---------------------------------------------------------
   Name   Age  City
0  True  True  True
1  True  True  True
2  True  True  True
---------------------------------------------------------
    Name  Age      City
0   John   25  New York
1  Alice   30    London
2    Bob   35     Paris
---------------------------------------------------------
    Name  Age      City
0   John   25  New York
1  Alice   30    London
2    Bob   35     Paris


4. <h1>Data Manipulation with DataFrames</h1>

<h2>Selecting and filtering data from DataFrames</h2>
Pandas provides several techniques to select and filter data from DataFrames based on specific conditions. You can use methods like <b>loc[]</b> and <b>iloc[]</b> for label-based and integer-based indexing, respectively. You can also apply boolean indexing to filter rows based on certain criteri

In [None]:
# Selecting columns by name
selected_columns = df[['Name', 'Age']]

# Selecting rows by label-based indexing
subset = df.loc[df['Age'] > 30]

# Selecting rows by integer-based indexing
subset = df.iloc[1:4]

# Applying boolean indexing
subset = df[df['City'] == 'New York']


<h2>Modifying DataFrame structure: adding, deleting, and renaming columns</h2>
Pandas allows you to modify the structure of DataFrames by adding, deleting, and renaming columns. You can use assignment (=) to add or modify columns, the <b>drop()</b> method to delete columns, and the <b>rename()</b> method to rename columns.

In [None]:
# Adding a new column
df['Gender'] = ['M', 'F', 'M']

# Modifying an existing column
df['Age'] = df['Age'] + 1

# Deleting a column
df = df.drop('City', axis=1)

# Renaming columns
df = df.rename(columns={'Name': 'Full Name', 'Age': 'Age (in years)'})


<h2>Sorting and ranking data</h2>
You can sort DataFrames based on one or more columns using the <b>sort_values()</b> method. Additionally, you can rank the data using the <b>rank()</b> method, which assigns ranks to the values within each column.

In [None]:
# Sorting DataFrame by column(s)
sorted_df = df.sort_values(by='Age', ascending=False)

# Sorting DataFrame by multiple columns
sorted_df = df.sort_values(by=['City', 'Age'])

# Ranking DataFrame values within each column
ranked_df = df.rank()


<h2>Handling duplicates in DataFrames</h2>
Pandas provides methods to identify and handle duplicate rows in DataFrames. The <b>duplicated()</b> method detects duplicate rows, and the <b>drop_duplicates()</b> method removes duplicates based on specific columns or the entire row.

In [None]:
# Identifying duplicate rows
duplicates = df.duplicated()

# Removing duplicate rows
df = df.drop_duplicates()

# Removing duplicate rows based on specific columns
df = df.drop_duplicates(subset=['Name', 'Age'])


<h2>Merging, joining, and concatenating DataFrames</h2>
Pandas offers various methods to combine DataFrames based on common columns or indexes. You can use <b>merge()</b> for database-style joins, <b>join()</b> for combining DataFrames on indexes, and <b>concat()</b> for concatenating DataFrames vertically or horizontally.

In [None]:
# Merging DataFrames based on common columns
merged_df = pd.merge(df1, df2, on='ID')

# Joining DataFrames based on indexes
joined_df = df1.join(df2)

# Concatenating DataFrames vertically
concatenated_df = pd.concat([df1, df2])

# Concatenating DataFrames horizontally
concatenated_df = pd.concat([df1, df2], axis=1)


5. <h1>Data Cleaning and Preprocessing</h1>

<h2>Dealing with missing data</h2>
imputation, dropping missing values:
Missing data is a common issue in datasets. Pandas provides methods to handle missing values, such as imputation (filling missing values with estimated values) and dropping missing values.

In [None]:
# Filling missing values with a specific value
df['Column'].fillna(value, inplace=True)

# Filling missing values with mean, median, or mode
df['Column'].fillna(df['Column'].mean(), inplace=True)

# Dropping rows with missing values
df.dropna(inplace=True)


<h2>Removing outliers from data</h2>
Outliers are extreme values that deviate significantly from the rest of the data. They can affect the accuracy and reliability of analysis. Pandas offers various methods to identify and handle outliers, such as using z-scores or percentiles.

In [None]:
# Identifying outliers using z-score
z_scores = (df['Column'] - df['Column'].mean()) / df['Column'].std()
outliers = df[z_scores > 3]

# Removing outliers based on z-score
df = df[z_scores <= 3]

# Identifying outliers using percentiles
q1 = df['Column'].quantile(0.25)
q3 = df['Column'].quantile(0.75)
iqr = q3 - q1
lower_bound = q1 - 1.5 * iqr
upper_bound = q3 + 1.5 * iqr
outliers = df[(df['Column'] < lower_bound) | (df['Column'] > upper_bound)]

# Removing outliers based on percentiles
df = df[(df['Column'] >= lower_bound) & (df['Column'] <= upper_bound)]


<h2>Data type conversion and handling categorical data</h2>
Pandas provides methods to convert data types and handle categorical variables. You can use the <b>astype()</b> method to convert data types, and the <b>pd.Categorical()</b> or <b>pd.get_dummies()</b> functions to handle categorical data.

In [None]:
# Converting data type of a column
df['Column'] = df['Column'].astype(int)

# Converting a column to categorical type
df['Category'] = pd.Categorical(df['Category'])

# Creating dummy variables for categorical data
dummy_df = pd.get_dummies(df['Category'])
df = pd.concat([df, dummy_df], axis=1)


<h2>String operations and text processing</h2>
Pandas provides several string methods that can be applied to string columns for text processing and cleaning operations. These methods include <b>str.lower()</b>, str.upper()</b>, <b>str.replace()</b>, <b>str.split()</b>, and more.

In [None]:
# Converting strings to lowercase
df['Column'] = df['Column'].str.lower()

# Replacing substrings in a column
df['Column'] = df['Column'].str.replace('old', 'new')

# Splitting strings into multiple columns
df[['First Name', 'Last Name']] = df['Full Name'].str.split(' ', expand=True)
