<a href="https://www.kaggle.com/code/ashmitcajla/4-basic-pandas-for-ml?scriptVersionId=234199311" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# 📘 Fun with Pandas - Your First Step into Data Science!

Welcome to **Pandas** – the Swiss Army knife of data manipulation in Python! 🐼  
This interactive guide will teach you how to load, analyze, filter, clean, and summarize data effortlessly.  
Let’s go step by step and get you Pandas-proficient in no time!


### 🔹 1. Getting Started with Pandas

In [1]:
import pandas as pd

# Check the version you're using
print("Pandas version:", pd.__version__)

Pandas version: 2.2.3


> 🧠 **Fun Fact:** Pandas is short for *Panel Data*, not the animal – though the logo is a panda for fun!

### 🔹 2. Create Your First DataFrame

In [2]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 35, 40],
    'City': ['New York', 'Paris', 'Berlin', 'Tokyo']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,Alice,25,New York
1,Bob,30,Paris
2,Charlie,35,Berlin
3,David,40,Tokyo


### 🔹 3. Reading Data from a CSV File

In [3]:
# Uncomment if you have a CSV
# df = pd.read_csv("your_dataset.csv")
# df.head()  # Show top 5 rows

> 💡 Use `.head()` to preview and `.tail()` to peek at the end.

### 🔹 4. Descriptive Statistics

In [4]:
df.describe()

Unnamed: 0,Age
count,4.0
mean,32.5
std,6.454972
min,25.0
25%,28.75
50%,32.5
75%,36.25
max,40.0


### 🔹 5. Selecting Columns and Rows

In [5]:
# Column selection
df['Name']

# Multiple columns
df[['Name', 'City']]

# Row selection by index
df.iloc[2]

# Row selection by condition
df[df['Age'] > 30]

Unnamed: 0,Name,Age,City
2,Charlie,35,Berlin
3,David,40,Tokyo


> 🧠 `.iloc` is index-based, `.loc` is label-based. Try both!

### 🔹 6. Adding & Modifying Columns

In [6]:
# Add a new column
df['Age in 5 Years'] = df['Age'] + 5
df

# Modify an existing one
df['Age'] = df['Age'] - 1

### 🔹 7. Deleting Columns or Rows

In [7]:
# Remove a column
df.drop('Age in 5 Years', axis=1, inplace=True)

# Remove a row
df.drop(2, axis=0, inplace=False)

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,29,Paris
3,David,39,Tokyo


> 🚨 Always set `inplace=True` if you want to apply the changes to the original DataFrame.

### 🔹 8. Sorting Data

In [8]:
df.sort_values(by='Age', ascending=True)

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,29,Paris
2,Charlie,34,Berlin
3,David,39,Tokyo


### 🔹 9. Grouping & Aggregation

In [9]:
# Group by City and calculate average age
df.groupby('City')['Age'].mean()

City
Berlin      34.0
New York    24.0
Paris       29.0
Tokyo       39.0
Name: Age, dtype: float64

### 🔹 10. Handling Missing Data (NaN)

In [10]:
# Simulate missing data
df.loc[1, 'Age'] = None
print("With NaN:\n", df)

# Fill with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)

With NaN:
       Name   Age      City
0    Alice  24.0  New York
1      Bob   NaN     Paris
2  Charlie  34.0    Berlin
3    David  39.0     Tokyo


  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Age'].fillna(df['Age'].mean(), inplace=True)


### 🔹 11. Exporting Data

In [11]:
df.to_csv("cleaned_data.csv", index=False)
print("Data saved as cleaned_data.csv")

Data saved as cleaned_data.csv


### 🔹 12. Bonus Tips & Tricks

In [12]:
# Value Counts
df['City'].value_counts()

# Apply a function
df['Age Group'] = df['Age'].apply(lambda x: 'Senior' if x > 30 else 'Young')
df

# Rename Columns
df.rename(columns={'Age': 'Age (Years)'}, inplace=True)

# Check for duplicates
df.duplicated().sum()

0

### 🔹 🎯 Final Challenge (Practice Time!)

Try this:
- Load a dataset of your choice from [Kaggle](https://kaggle.com)
- Clean the data: remove NaNs, drop duplicates
- Calculate groupwise metrics (mean/sum/count)
- Create new columns using custom logic
- Export cleaned data

> 🤯 **Fun Fact Before You Go!**
>
> Pandas is used by **Netflix**, **Spotify**, and **NASA** to analyze data. You're in good company! 🚀