## Data Manipulation and Analysis with Pandas

Data manipulation and analysis are key tasks in any data science or data analysis project. Pandas provides a wide range of functions for data analysis and data manipulation, mking it easier to clean, transform, and extract insights from data.

--- 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('dataset.csv')
df.head(5)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0


In [3]:
df.describe()

Unnamed: 0,Rank,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore
count,1000.0,1000.0,1000.0,1000.0,1000.0,872.0,936.0
mean,500.5,2012.783,113.172,6.7232,169808.3,82.956376,58.985043
std,288.819436,3.205962,18.810908,0.945429,188762.6,103.25354,17.194757
min,1.0,2006.0,66.0,1.9,61.0,0.0,11.0
25%,250.75,2010.0,100.0,6.2,36309.0,13.27,47.0
50%,500.5,2014.0,111.0,6.8,110799.0,47.985,59.5
75%,750.25,2016.0,123.0,7.4,239909.8,113.715,72.0
max,1000.0,2016.0,191.0,9.0,1791916.0,936.63,100.0


In [4]:
df.dtypes

Rank                    int64
Title                  object
Genre                  object
Description            object
Director               object
Actors                 object
Year                    int64
Runtime (Minutes)       int64
Rating                float64
Votes                   int64
Revenue (Millions)    float64
Metascore             float64
dtype: object

### Handling Missing Values Using 'isna()':

In [5]:
df.isnull().sum()

Rank                    0
Title                   0
Genre                   0
Description             0
Director                0
Actors                  0
Year                    0
Runtime (Minutes)       0
Rating                  0
Votes                   0
Revenue (Millions)    128
Metascore              64
dtype: int64

In [6]:
df.isnull().any()

Rank                  False
Title                 False
Genre                 False
Description           False
Director              False
Actors                False
Year                  False
Runtime (Minutes)     False
Rating                False
Votes                 False
Revenue (Millions)     True
Metascore              True
dtype: bool

---

```python
df.fillna(value, (optional)method = 'ffill' | 'bfill')
```

ffill (forward fill)
- Propagates the last non-NA value forward to fill subsequent NA values along the chosen axis.
- If there is no prior non-NA value, the NA stays.
- Useful for carrying last observation forward in time series.

bfill (backward fill)
- Propagates the next non-NA value backward to fill prior NA values.
- If there is no following non-NA value, the NA stays.

Options
- limit=int — max consecutive NAs to fill.
- axis=0 (default) fills down rows; axis=1 fills across columns.
- inplace=True to modify the object in place.

Examples:


In [7]:
# python
import pandas as pd
s = pd.Series([1, None, None, 2, None])

s.fillna(method='ffill')
# -> 0    1.0
#    1    1.0
#    2    1.0
#    3    2.0
#    4    2.0

s.fillna(method='bfill')
# -> 0    1.0
#    1    2.0
#    2    2.0
#    3    2.0
#    4    NaN

  s.fillna(method='ffill')
  s.fillna(method='bfill')


0    1.0
1    2.0
2    2.0
3    2.0
4    NaN
dtype: float64

---

In [8]:
df_filled = df.fillna(0)
print(df_filled.isna().sum())

Rank                  0
Title                 0
Genre                 0
Description           0
Director              0
Actors                0
Year                  0
Runtime (Minutes)     0
Rating                0
Votes                 0
Revenue (Millions)    0
Metascore             0
dtype: int64


### Filling missing values with mean of the column :


In [9]:
df['Revenue (Millions)_filled'] = df['Revenue (Millions)'].fillna(df['Revenue (Millions)'].mean())

print(df['Revenue (Millions)'].isna().sum())
print(df['Revenue (Millions)_filled'].isna().sum())

128
0


In [10]:
df[df['Revenue (Millions)'].isna()].head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue (Millions)_filled
7,8,Mindhorn,Comedy,A has-been actor best known for playing the ti...,Sean Foley,"Essie Davis, Andrea Riseborough, Julian Barrat...",2016,89,6.4,2490,,71.0,82.956376
22,23,Hounds of Love,"Crime,Drama,Horror",A cold-blooded predatory couple while cruising...,Ben Young,"Emma Booth, Ashleigh Cummings, Stephen Curry,S...",2016,108,6.7,1115,,72.0,82.956376
25,26,Paris pieds nus,Comedy,Fiona visits Paris for the first time to assis...,Dominique Abel,"Fiona Gordon, Dominique Abel,Emmanuelle Riva, ...",2016,83,6.8,222,,,82.956376
39,40,5- 25- 77,"Comedy,Drama","Alienated, hopeful-filmmaker Pat Johnson's epi...",Patrick Read Johnson,"John Francis Daley, Austin Pendleton, Colleen ...",2007,113,7.1,241,,,82.956376
42,43,Don't Fuck in the Woods,Horror,A group of friends are going on a camping trip...,Shawn Burkett,"Brittany Blanton, Ayse Howard, Roman Jossart,N...",2016,73,2.7,496,,,82.956376


### Renaming columns :

In [11]:
df.rename(columns={'Revenue (Millions)_filled': 'Revenue (Millions) filled'}, inplace=True)
df.head()

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue (Millions) filled
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,333.13
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,126.46
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,138.12
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,270.32
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,325.02


### Changing datatypes :

In [12]:
df['Rating_new'] = df['Rating'].astype('int')
df.head(5)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue (Millions) filled,Rating_new
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,333.13,8
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,126.46,7
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,138.12,7
3,4,Sing,"Animation,Comedy,Family","In a city of humanoid animals, a hustling thea...",Christophe Lourdelet,"Matthew McConaughey,Reese Witherspoon, Seth Ma...",2016,108,7.2,60545,270.32,59.0,270.32,7
4,5,Suicide Squad,"Action,Adventure,Fantasy",A secret government agency recruits some of th...,David Ayer,"Will Smith, Jared Leto, Margot Robbie, Viola D...",2016,123,6.2,393727,325.02,40.0,325.02,6


### Data Aggregation and Grouping :

In [14]:
df.head(3)

Unnamed: 0,Rank,Title,Genre,Description,Director,Actors,Year,Runtime (Minutes),Rating,Votes,Revenue (Millions),Metascore,Revenue (Millions) filled,Rating_new
0,1,Guardians of the Galaxy,"Action,Adventure,Sci-Fi",A group of intergalactic criminals are forced ...,James Gunn,"Chris Pratt, Vin Diesel, Bradley Cooper, Zoe S...",2014,121,8.1,757074,333.13,76.0,333.13,8
1,2,Prometheus,"Adventure,Mystery,Sci-Fi","Following clues to the origin of mankind, a te...",Ridley Scott,"Noomi Rapace, Logan Marshall-Green, Michael Fa...",2012,124,7.0,485820,126.46,65.0,126.46,7
2,3,Split,"Horror,Thriller",Three girls are kidnapped by a man with a diag...,M. Night Shyamalan,"James McAvoy, Anya Taylor-Joy, Haley Lu Richar...",2016,117,7.3,157606,138.12,62.0,138.12,7


In [None]:
df['Year'].value_counts()

Year
2016    297
2015    127
2014     98
2013     91
2012     64
2011     63
2010     60
2007     53
2008     52
2009     51
2006     44
Name: count, dtype: int64

In [21]:
# this is done using groupby() inbuilt function of pandas

grouped_df = df.groupby("Year")['Revenue (Millions) filled'].mean()
grouped_df


Year
2006     86.144835
2007     87.510481
2008     98.772623
2009    110.276186
2010    103.975319
2011     87.538355
2012    107.973281
2013     86.984496
2014     84.992097
2015     78.862278
2016     63.446588
Name: Revenue (Millions) filled, dtype: float64

In [None]:
grouped_sum = df.groupby(["Year","Director"])['Revenue (Millions) filled'].sum()
grouped_sum

Year  Director                   
2006  Adam McKay                     148.210000
      Alejandro González Iñárritu     34.300000
      Alexandre Aja                   41.780000
      Alfonso Cuarón                  35.290000
      Andy Fickman                     2.340000
                                        ...    
2016  William Oldroyd                 82.956376
      Woody Allen                     11.080000
      Xavier Dolan                    82.956376
      Yimou Zhang                     45.130000
      Zack Snyder                    330.250000
Name: Revenue (Millions) filled, Length: 987, dtype: float64

In [32]:
# Multple Aggregations

aggregated_df = df.groupby("Year")['Revenue (Millions) filled'].agg(['mean', 'sum','count'])
aggregated_df

Unnamed: 0_level_0,mean,sum,count
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2006,86.144835,3790.372752,44
2007,87.510481,4638.055505,53
2008,98.772623,5136.176376,52
2009,110.276186,5624.085505,51
2010,103.975319,6238.519128,60
2011,87.538355,5514.916376,63
2012,107.973281,6910.29,64
2013,86.984496,7915.589128,91
2014,84.992097,8329.225505,98
2015,78.862278,10015.509266,127


---

### Pandas GroupBy :
> **What groupby does (split‑apply‑combine):**
- Split: data is split into groups based on column(s).
- Apply: a function (aggregation, transform, filter, apply) runs on each group.
- Combine: results are stitched back together.

> **Common uses:**
- Aggregation (sum, mean, count, etc.) to get one row per group.
- Transform to return a value for every original row (e.g., group mean).
- Filter to keep/drop whole groups by a condition.
- Apply for custom per‑group operations.

> **Key methods:**
- agg(...) or .aggregate(...) — multiple aggregations, returns smaller dataframe/series.
- transform(...) — returns same shape as input.
- filter(...) — keeps groups that meet a boolean condition.
- apply(...) — arbitrary function per group.
- groups / iterating with for name, group in grouped.

> **Example usage :**

```python

# 1) mean revenue per Year (one value per Year)
grouped = df.groupby('Year')['Revenue (Millions) filled'].mean()

# 2) multiple aggregations
agg_df = df.groupby('Year').agg({'Revenue (Millions) filled': ['mean','sum','count']})

# 3) groupby + transform (returns same shape as df)
df['rev_minus_group_mean'] = df['Revenue (Millions) filled'] - df.groupby('Year')['Revenue (Millions) filled'].transform('mean')

# 4) filter groups (keep Years with >= 5 rows)
big_years = df.groupby('Year').filter(lambda g: len(g) >= 5)

```

--- 


### Merging and Joining DataFrames :

In [36]:
df1 = pd.DataFrame({'Key': ['A', 'B', 'C'],'Value1': [1, 2, 3]})
df2 = pd.DataFrame({'Key': ['A', 'B', 'D'],'Value2': [4, 5, 6]})

print(df1)
print("\n")
print(df2)

  Key  Value1
0   A       1
1   B       2
2   C       3


  Key  Value2
0   A       4
1   B       5
2   D       6


In [None]:
# Merge Dataframe on Key Columns
pd.merge(df1,df2,on='Key',how='inner') # inner join

Unnamed: 0,Key,Value1,Value2
0,A,1,4
1,B,2,5


In [None]:
pd.merge(df1,df2,on='Key',how='left') #left join

Unnamed: 0,Key,Value1,Value2
0,A,1,4.0
1,B,2,5.0
2,C,3,


In [None]:
pd.merge(df1,df2,on='Key',how='right') # right join

Unnamed: 0,Key,Value1,Value2
0,A,1.0,4
1,B,2.0,5
2,D,,6


In [None]:
pd.merge(df1,df2,on='Key',how='outer') # outer join

Unnamed: 0,Key,Value1,Value2
0,A,1.0,4.0
1,B,2.0,5.0
2,C,3.0,
3,D,,6.0
