#### Data Manipulation and Analysis with Pandas
Data manipulation and analysis are key tasks in any data science or data analysis project. Pandas provides a wide range of functions for data manipulation and analysis, making it easier to clean, transform, and extract insights from data. In this lesson, we will cover various data manipulation and analysis techniques using Pandas.

In [1]:
import pandas as pd

In [None]:
df=pd.read_csv('data.csv')
## fecth the first 5 rows
df.head(5)

In [None]:
df.tail(5)

In [None]:
df.dtypes

In [None]:
df.describe()

In [None]:
## Handling Missing Values
df.isnull()

In [None]:
#  Checking in which column missing values present
df.isnull().any()

In [None]:
# How many missing values in columns
df.isnull().sum()

In [9]:
# filling missing values with zeroes
df_filled=df.fillna(0)

In [None]:
df_filled

In [None]:
### filling missing values with the mean of the column
# This code correctly fills missing values in the "Sales" column with the mean of that column and stores the result in a new column called "Sales_fillNA"
df['Sales_fillNA']=df['Sales'].fillna(df['Sales'].mean())
df

In [None]:
df.dtypes

In [None]:
# Names the colums present
df.columns

In [None]:
## Renaming Columns
# renames the column "Sale Date" to "Sales Date"
df=df.rename(columns={'Date':'Sales Date'})
df.head()

In [None]:
## change datatypes
df['Value_new']=df['Value'].fillna(df['Value'].mean()).astype(int)
df.head()

In [None]:
df['New Value']=df['Value'].apply(lambda x:x*2)
df.head()

In [None]:
## Data Aggregating And Grouping
df.head()

In [None]:
df.dtypes

In [None]:
grouped_mean=df.groupby('Product')['Value'].mean()
print(grouped_mean)

'''
The code groups a DataFrame by 'Product' and computes the mean of the 'Value' column for each group, returning a new series with product names as indices and average values as corresponding values.

Grouping the Data:

df.groupby('Product') tells Python to take your DataFrame (df) and split it into groups based on the unique values in the 'Product' column.
For example, if your DataFrame has several products like "A", "B", and "C", this step will create three groups: one for each product.
Selecting the 'Value' Column:

After grouping, ['Value'] selects only the 'Value' column from each group.
This means we're only interested in the numerical values from the 'Value' column for the calculations.
Calculating the Mean:

.mean() computes the average of the 'Value' column within each product group.
For each product group, it adds up all the values in the 'Value' column and divides by the number of entries in that group.
Storing the Result:

The result is stored in the variable grouped_mean. This is typically a Series where:
The index is the unique product names.
The values are the calculated averages for each product.
Printing the Result:

print(grouped_mean) displays the average 'Value' for each product group on the screen.

'''

In [None]:
grouped_sum=df.groupby(['Product','Region'])['Value'].sum()
print(grouped_sum)

In [None]:
df.groupby(['Product','Region'])['Value'].mean()

In [None]:
## aggregate multiple functions
groudped_agg=df.groupby('Region')['Value'].agg(['mean','sum','count'])
groudped_agg

'''

Grouping the DataFrame by Region:

df.groupby('Region') tells Python to split your DataFrame into groups based on the unique values in the 'Region' column.
For example, if your DataFrame contains regions like "North", "South", "East", and "West", this command will create a group for each region.
Selecting the 'Value' Column:

['Value'] specifies that we only want to work with the 'Value' column from each group.
This column is assumed to be numerical, which is important because we will be calculating statistics on it.
Applying Multiple Aggregation Functions:

The .agg(['mean', 'sum', 'count']) method tells Python to perform three different operations on the 'Value' column for each group:
mean: Calculates the average value.
sum: Adds up all the values.
count: Counts the number of entries (rows) in each group.
These functions are applied simultaneously, and the result is a new DataFrame that includes all three statistics for each region.
Storing and Displaying the Result:

The result is stored in the variable groudped_agg (note: there’s a small typo in the variable name; it likely should be grouped_agg).
When you run groudped_agg (or print it), you will see a table where:
The index consists of the unique regions.
The columns are mean, sum, and count, each showing the corresponding calculation for the 'Value' column in that region.
'''

In [24]:
### Merging and joining Dataframes
# Create sample DataFrames
df1 = pd.DataFrame({'Key': ['A', 'B', 'C'], 'Value1': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['A', 'B', 'D'], 'Value2': [4, 5, 6]})

In [None]:
df1

In [None]:
df2

In [None]:
## Merge Datafranme on the 'Key Columns'
pd.merge(df1,df2,on="Key",how="inner")

In [None]:
pd.merge(df1,df2,on="Key",how="outer")

In [None]:
pd.merge(df1,df2,on="Key",how="left")

In [None]:
pd.merge(df1,df2,on="Key",how="right")