# 🧹 Data Cleaning Summary

This notebook summarizes all the key data cleaning steps performed during the IPL Data Analysis project. Proper cleaning ensures the data is reliable, consistent, and ready for visualization or reporting.


In [5]:
# Importing necessary libraries
import pandas as pd

# Load the cleaned dataset (adjust path/filename as needed)
df = pd.read_csv('../Data/Virat_Kohli_YearWise_with_IPL.csv')

# Overview of the dataset
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 28 entries, 0 to 27
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Format        28 non-null     object 
 1   Year          28 non-null     int64  
 2   Matches       28 non-null     int64  
 3   Innings       28 non-null     int64  
 4   Runs          28 non-null     int64  
 5   Average       28 non-null     float64
 6   HighestScore  28 non-null     object 
 7   Fifties       28 non-null     int64  
 8   Centuries     28 non-null     int64  
dtypes: float64(1), int64(6), object(2)
memory usage: 2.1+ KB


## 🔍 Null Values & Duplicates Check

We check for missing values and duplicated rows to ensure data quality.


In [6]:
# Checking for null values
print("Null values in each column:")
print(df.isnull().sum())

# Checking for duplicates
print("\nDuplicate rows:", df.duplicated().sum())


Null values in each column:
Format          0
Year            0
Matches         0
Innings         0
Runs            0
Average         0
HighestScore    0
Fifties         0
Centuries       0
dtype: int64

Duplicate rows: 0


## 🧾 Data Type Conversion

Ensured columns like `Year`, `Matches`, and `Runs` were in the correct format (e.g., integers). This improves accuracy during plotting and calculation.


In [7]:
# Fixing data types (example if needed)
df['Year'] = df['Year'].astype(int)
df['Runs'] = pd.to_numeric(df['Runs'], errors='coerce')



## ✏️ Column Renaming (if applicable)

Renamed inconsistent column headers to maintain uniformity.


In [8]:
df.rename(columns={
    '50s': 'Fifties',
    '100s': 'Centuries'
}, inplace=True)


## ✅ Final Statement

The dataset has been successfully cleaned and validated. It's now ready for use in statistical analysis, visualizations, and dashboarding.
