## **WEEK 4: DATASET MANIPULATION USING PYTHON**

### **Day 1: Introduction to Pandas and Creating Datasets using Series**

In [2]:
import pandas as pd

# Creating Series for vegetables and their stock in kg
vegetables = pd.Series(['Tomato', 'Potato', 'Onion', 'Carrot'])
stocks = pd.Series([25, 40, 30, 20])

# Displaying the Series
print("Vegetables Series:")
print(vegetables)
print("\nStocks Series (in kg):")
print(stocks)

# Accessing the first vegetable and the last stock value
print("\nFirst vegetable:", vegetables.iloc[0])
print("Last stock value:", stocks.iloc[-1])

# Calculating total stock and average stock
print("\nTotal stock of all vegetables (kg):", stocks.sum())
print("Average stock of vegetables (kg):", stocks.mean())

# Creating a DataFrame
veg_df = pd.DataFrame({'Vegetable': vegetables, 'Stock (kg)': stocks})
print("\nVegetable DataFrame:")
print(veg_df)

# Adding a column for price per kg
veg_df['Price per kg'] = [50, 20, 40, 30]
print("\nDataFrame after adding Price per kg column:")
print(veg_df)

# Calculating the total value of each vegetable (Stock * Price)
veg_df['Total Value'] = veg_df['Stock (kg)'] * veg_df['Price per kg']
print("\nDataFrame after adding Total Value column:")
print(veg_df)

Vegetables Series:
0    Tomato
1    Potato
2     Onion
3    Carrot
dtype: object

Stocks Series (in kg):
0    25
1    40
2    30
3    20
dtype: int64

First vegetable: Tomato
Last stock value: 20

Total stock of all vegetables (kg): 115
Average stock of vegetables (kg): 28.75

Vegetable DataFrame:
  Vegetable  Stock (kg)
0    Tomato          25
1    Potato          40
2     Onion          30
3    Carrot          20

DataFrame after adding Price per kg column:
  Vegetable  Stock (kg)  Price per kg
0    Tomato          25            50
1    Potato          40            20
2     Onion          30            40
3    Carrot          20            30

DataFrame after adding Total Value column:
  Vegetable  Stock (kg)  Price per kg  Total Value
0    Tomato          25            50         1250
1    Potato          40            20          800
2     Onion          30            40         1200
3    Carrot          20            30          600


### **Day 2: Creating DataFrames Manually and Using Dictionaries**

In [3]:
import pandas as pd

# Creating a DataFrame using a dictionary with Indian names and student details
data = {
    'Name': ['Aarav', 'Ishaan', 'Meera', 'Diya', 'Ravi'],
    'Age': [21, 22, 20, 23, 24],
    'Marks': [78, 85, 90, 88, 95]
}

df = pd.DataFrame(data)

# Displaying the DataFrame
print("Initial DataFrame:")
print(df)

# Checking the structure of the DataFrame
print("\nDataFrame Shape:", df.shape)
print("Column Names:", df.columns)
print("DataFrame Info:")
print(df.info())
print("\nSummary Statistics:")
print(df.describe())

# Adding a new column based on Marks using apply method
df['Pass/Fail'] = df['Marks'].apply(lambda x: 'Pass' if x >= 85 else 'Fail')
print("\nDataFrame after adding 'Pass/Fail' column using apply:")
print(df)

# Using map() to change names to uppercase
df['Name_Upper'] = df['Name'].map(lambda x: x.upper())
print("\nDataFrame after adding 'Name_Upper' column with uppercase names:")
print(df)

# Using loc[] to update a specific column (Age)
df.loc[df['Name'] == 'Aarav', 'Age'] = 22  # Changing Aarav's age
print("\nDataFrame after updating 'Age' of Aarav using loc:")
print(df)

# Removing a column using pop method
df.pop('Name_Upper')
print("\nDataFrame after removing 'Name_Upper' column using pop:")
print(df)

# Renaming the 'Marks' column to 'Total Marks' using rename
df.rename(columns={'Marks': 'Total Marks'}, inplace=True)
print("\nDataFrame after renaming 'Marks' to 'Total Marks':")
print(df)

Initial DataFrame:
     Name  Age  Marks
0   Aarav   21     78
1  Ishaan   22     85
2   Meera   20     90
3    Diya   23     88
4    Ravi   24     95

DataFrame Shape: (5, 3)
Column Names: Index(['Name', 'Age', 'Marks'], dtype='object')
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   Marks   5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes
None

Summary Statistics:
             Age      Marks
count   5.000000   5.000000
mean   22.000000  87.200000
std     1.581139   6.300794
min    20.000000  78.000000
25%    21.000000  85.000000
50%    22.000000  88.000000
75%    23.000000  90.000000
max    24.000000  95.000000

DataFrame after adding 'Pass/Fail' column using apply:
     Name  Age  Marks Pass/Fail
0   Aarav   21     78      Fail
1  Ishaan   2

### **Day 3: Indexing, Slicing, and Conditional Selection**

In [4]:
import pandas as pd

# Creating a DataFrame with Indian cities and their populations
data = {
    'City': ['Mumbai', 'Delhi', 'Bengaluru', 'Kolkata', 'Chennai'],
    'Population': [20411000, 16787941, 8443675, 4486679, 4646732],
    'Area (sq km)': [603.4, 1484, 709, 185, 426]
}

df = pd.DataFrame(data)

# Selecting data using .loc[] (label-based)
print("Using .loc[] to select rows:")
print(df.loc[1:3])  # Selecting rows with labels 1 to 3 (inclusive)

# Selecting data using .iloc[] (integer position-based)
print("\nUsing .iloc[] to select rows:")
print(df.iloc[2:5])  # Selecting rows with indices 2 to 4 (exclusive)

# Conditional selection using Boolean indexing
print("\nCities with Population > 7 million:")
print(df[df['Population'] > 7000000])

# Setting 'City' column as the index
df.set_index('City', inplace=True)
print("\nDataFrame after setting 'City' as the index:")
print(df)

# Resetting the index
df.reset_index(inplace=True)
print("\nDataFrame after resetting the index:")
print(df)

Using .loc[] to select rows:
        City  Population  Area (sq km)
1      Delhi    16787941        1484.0
2  Bengaluru     8443675         709.0
3    Kolkata     4486679         185.0

Using .iloc[] to select rows:
        City  Population  Area (sq km)
2  Bengaluru     8443675         709.0
3    Kolkata     4486679         185.0
4    Chennai     4646732         426.0

Cities with Population > 7 million:
        City  Population  Area (sq km)
0     Mumbai    20411000         603.4
1      Delhi    16787941        1484.0
2  Bengaluru     8443675         709.0

DataFrame after setting 'City' as the index:
           Population  Area (sq km)
City                               
Mumbai       20411000         603.4
Delhi        16787941        1484.0
Bengaluru     8443675         709.0
Kolkata       4486679         185.0
Chennai       4646732         426.0

DataFrame after resetting the index:
        City  Population  Area (sq km)
0     Mumbai    20411000         603.4
1      Delhi    16787

### **Day 4: Advanced DataFrame Operations (Sorting, Filtering, Aggregation)**

In [None]:
import pandas as pd
# Creating a DataFrame with employee details
data = {
    'Name': ['Raj', 'Simran', 'Amit', 'Nina', 'Vikram'],
    'Age': [27, 32, 24, 29, 26],
    'Score': [88, 95, 82, 90, 92],
    'Department': ['HR', 'Finance', 'HR', 'Marketing', 'Finance']
}
df = pd.DataFrame(data)

# Sorting the DataFrame by Age
sorted_df = df.sort_values(by='Age', ascending=True)
print("DataFrame sorted by Age:")
print(sorted_df)

# Filtering data using .query()
high_score_employees = df.query('Score > 90')
print("\nEmployees with Score > 90 using .query():")
print(high_score_employees)

# Grouping data by Department and calculating average Score
avg_score_by_dept = df.groupby('Department')['Score'].mean()
print("\nAverage Score by Department:")
print(avg_score_by_dept)

# Aggregating multiple statistics
agg_stats = df.groupby('Department').agg({'Score': ['mean', 'max'], 'Age': 'min'})
print("\nAggregated statistics by Department:")
print(agg_stats)

DataFrame sorted by Age:
     Name  Age  Score Department
2    Amit   24     82         HR
4  Vikram   26     92    Finance
0     Raj   27     88         HR
3    Nina   29     90  Marketing
1  Simran   32     95    Finance

Employees with Score > 90 using .query():
     Name  Age  Score Department
1  Simran   32     95    Finance
4  Vikram   26     92    Finance

Average Score by Department:
Department
Finance      93.5
HR           85.0
Marketing    90.0
Name: Score, dtype: float64

Aggregated statistics by Department:
           Score     Age
            mean max min
Department              
Finance     93.5  95  26
HR          85.0  88  24
Marketing   90.0  90  29


### **Day 5-7: Movie Rating Prediction (Reading Data, Handling Imbalances)**

In [6]:
data=pd.read_csv(r"C:\Users\Dell\IMDb Movies India.csv",encoding="latin")
data.head()

Unnamed: 0,Name,Year,Duration,Genre,Rating,Votes,Director,Actor 1,Actor 2,Actor 3
0,,,,Drama,,,J.S. Randhawa,Manmauji,Birbal,Rajendra Bhatia
1,#Gadhvi (He thought he was Gandhi),(2019),109 min,Drama,7.0,8.0,Gaurav Bakshi,Rasika Dugal,Vivek Ghamande,Arvind Jangid
2,#Homecoming,(2021),90 min,"Drama, Musical",,,Soumyajit Majumdar,Sayani Gupta,Plabita Borthakur,Roy Angana
3,#Yaaram,(2019),110 min,"Comedy, Romance",4.4,35.0,Ovais Khan,Prateik,Ishita Raj,Siddhant Kapoor
4,...And Once Again,(2010),105 min,Drama,,,Amol Palekar,Rajat Kapoor,Rituparna Sengupta,Antara Mali


In [7]:
data.shape

(15509, 10)

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15509 entries, 0 to 15508
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Name      15509 non-null  object 
 1   Year      14981 non-null  object 
 2   Duration  7240 non-null   object 
 3   Genre     13632 non-null  object 
 4   Rating    7919 non-null   float64
 5   Votes     7920 non-null   object 
 6   Director  14984 non-null  object 
 7   Actor 1   13892 non-null  object 
 8   Actor 2   13125 non-null  object 
 9   Actor 3   12365 non-null  object 
dtypes: float64(1), object(9)
memory usage: 1.2+ MB


In [9]:
data.isnull().sum()

Name           0
Year         528
Duration    8269
Genre       1877
Rating      7590
Votes       7589
Director     525
Actor 1     1617
Actor 2     2384
Actor 3     3144
dtype: int64

In [11]:
data["Duration"]=data["Duration"].str.replace(" min","").astype(float)
data["Duration"].head()

0      NaN
1    109.0
2     90.0
3    110.0
4    105.0
Name: Duration, dtype: float64

In [12]:
data["Duration"].fillna(data["Duration"].median(),inplace=True)