Day 1 – Project Setup & Data Understanding

Objective: Set up your environment and understand the dataset you’ll work with.

Tasks:

Project Setup

Install necessary libraries: pandas, numpy, matplotlib, seaborn.

Create project folder:

Pandas_Project/
├── data/
├── notebooks/
├── scripts/
└── output/


Start a Jupyter Notebook for your work.

Dataset Selection

Choose a real-world dataset (CSV or Excel). Suggestions:

Sales data

Employee data

E-commerce product data

Place it in the data/ folder.

Data Loading & Inspection

Load the dataset with pd.read_csv() or pd.read_excel().

Explore first 5–10 rows using .head().

Check dataset info using .info() and .describe().

Initial Questions

Identify the columns, data types, and missing values.

Note down interesting points about the dataset: what analysis might be done later.

Deliverable by the end of Day 1:

A notebook with dataset loaded, basic inspection done, and observations written down.

If you want, I can also plan the full 7–10 day roadmap for this project so each day has clear objectives and it’s guaranteed to cover all major Pandas concepts.

Do you want me to do that?

In [6]:
import pandas as pd

df = pd.read_csv("Employee.csv")
df.head(10)
df.tail(10)
df.info()
df.describe()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4653 entries, 0 to 4652
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   Education                  4653 non-null   object
 1   JoiningYear                4653 non-null   int64 
 2   City                       4653 non-null   object
 3   PaymentTier                4653 non-null   int64 
 4   Age                        4653 non-null   int64 
 5   Gender                     4653 non-null   object
 6   EverBenched                4653 non-null   object
 7   ExperienceInCurrentDomain  4653 non-null   int64 
 8   LeaveOrNot                 4653 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 327.3+ KB


Unnamed: 0,JoiningYear,PaymentTier,Age,ExperienceInCurrentDomain,LeaveOrNot
count,4653.0,4653.0,4653.0,4653.0,4653.0
mean,2015.06297,2.698259,29.393295,2.905652,0.343864
std,1.863377,0.561435,4.826087,1.55824,0.475047
min,2012.0,1.0,22.0,0.0,0.0
25%,2013.0,3.0,26.0,2.0,0.0
50%,2015.0,3.0,28.0,3.0,0.0
75%,2017.0,3.0,32.0,4.0,1.0
max,2018.0,3.0,41.0,7.0,1.0


Day 2 – Data Cleaning & Preparation

Objective: Clean the dataset and prepare it for analysis.

Tasks:

Handle Missing Values

Identify missing values using df.isnull().sum().

Decide how to handle them:

Drop rows or columns with too many missing values (df.dropna()).

Fill missing values (df.fillna()), e.g., mean/median for numeric, mode for categorical.

Handle Duplicates

Check duplicates with df.duplicated().sum().

Remove duplicates using df.drop_duplicates().

Correct Data Types

Check data types using df.dtypes.

Convert columns if needed:

df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')


Rename Columns (Optional)

Make column names simple and consistent:

df.rename(columns={'OldName':'new_name'}, inplace=True)


Initial Feature Engineering

Add any new useful columns if needed, e.g., extracting year/month from a date:

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month


Deliverable by the end of Day 2:

Cleaned dataset ready for analysis.

Notebook with missing value handling, duplicates removed, data types corrected, and any new columns added.

If you want, I can also plan Day 3, which will focus on exploratory data analysis (EDA) and visualizations, so you can start seeing insights from your dataset.

In [9]:
df.isnull().sum
df.dropna()
df.fillna()

ValueError: Must specify a fill 'value' or 'method'.