# PANDAS

![Pandas Image](images/pandas1.png)

![Pandas Image](images/class2.png)

![image](images/image1.png)

![image](images/image2.png)

In [1]:
import pandas as pd

s1 = pd.Series([1, 2, 3, 4, 5])
s1

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [3]:
type(s1)

pandas.core.series.Series

![image](images/image3.png)

In [4]:
# Changing index
s1 = pd.Series([1, 2, 3, 4, 5], index = ["a", "b", "c", "d", "e"])
s1

a    1
b    2
c    3
d    4
e    5
dtype: int64

![image](images/image4.png)

In [5]:
pd.Series({
    "k1": 10,
    "k2": 20,
    "k3": 30
})

k1    10
k2    20
k3    30
dtype: int64

![image](images/image5.png)

In [6]:
s2  = pd.Series({
    "k1": 10,
    "k2": 20,
    "k3": 30
}, index = ["k3", "k1", "k4", "k2"])

s2

k3    30.0
k1    10.0
k4     NaN
k2    20.0
dtype: float64

![image](images/6.png)

In [9]:
l1 = [1, 2, 3, 4, 5, 6, 7, 8, 9]

s3 = pd.Series(l1)
s3

0    1
1    2
2    3
3    4
4    5
5    6
6    7
7    8
8    9
dtype: int64

In [10]:
s3[:4]

0    1
1    2
2    3
3    4
dtype: int64

In [11]:
s3[4]

5

In [12]:
s3[-2: ]

7    8
8    9
dtype: int64

![image](images/7.png)

In [13]:
s3 + 2

0     3
1     4
2     5
3     6
4     7
5     8
6     9
7    10
8    11
dtype: int64

In [14]:
s3 * 5

0     5
1    10
2    15
3    20
4    25
5    30
6    35
7    40
8    45
dtype: int64

In [15]:
s4 = pd.Series([10, 20, 30, 40, 50, 60, 70, 80, 90])

s3 + s4

0    11
1    22
2    33
3    44
4    55
5    66
6    77
7    88
8    99
dtype: int64

![image](images/8.png)

![image](images/9.png)

In [17]:
pd.DataFrame({
    "Name": [
        "Bob", "Sam", "Sana", "Yahya",
        "Taha", "Hamza", "Jon", "Michael"
    ],
    "Marks": [
        73, 66, 90, 80, 99, 92, 86, 85
    ]
})

Unnamed: 0,Name,Marks
0,Bob,73
1,Sam,66
2,Sana,90
3,Yahya,80
4,Taha,99
5,Hamza,92
6,Jon,86
7,Michael,85


![image](images/10.png)

In [2]:
iris = pd.read_csv("datasets/iris.csv")
iris.head() # Gives the first 5 rows

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [21]:
iris.tail() # Gives the last 5 rows

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
145,6.7,3.0,5.2,2.3,Virginica
146,6.3,2.5,5.0,1.9,Virginica
147,6.5,3.0,5.2,2.0,Virginica
148,6.2,3.4,5.4,2.3,Virginica
149,5.9,3.0,5.1,1.8,Virginica


In [22]:
iris.shape # Tell the total rows and columns

(150, 5)

In [24]:
iris.describe() # Gives the summary of the data

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
count,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333
std,0.828066,0.435866,1.765298,0.762238
min,4.3,2.0,1.0,0.1
25%,5.1,2.8,1.6,0.3
50%,5.8,3.0,4.35,1.3
75%,6.4,3.3,5.1,1.8
max,7.9,4.4,6.9,2.5


You can also use it for non-numeric columns:
```python
iris.describe(include='object') # if iris dataset has non-numeric values

![image](images/11.png)

In [26]:
iris.iloc[5:11, 2:]  # 11th row is not included

Unnamed: 0,petal.length,petal.width,variety
5,1.7,0.4,Setosa
6,1.4,0.3,Setosa
7,1.5,0.2,Setosa
8,1.4,0.2,Setosa
9,1.5,0.1,Setosa
10,1.5,0.2,Setosa


![image](images/12.png)

In [29]:
iris.loc[0:3,("sepal.length", "petal.length")] # Here in .loc 0:3 means 3 is inclusive, means it will give you total 4 rows

Unnamed: 0,sepal.length,petal.length
0,5.1,1.4
1,4.9,1.4
2,4.7,1.3
3,4.6,1.5


![image](images/13.png)

In [30]:
iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [31]:
# if i want to drop variety column
iris.drop("variety", axis = 1) # axis = 1 will drop column | axis = 0 will drop rows, see the following code.

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


![image](images/14.png)

In [32]:
iris.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
2,4.7,3.2,1.3,0.2,Setosa
3,4.6,3.1,1.5,0.2,Setosa
4,5.0,3.6,1.4,0.2,Setosa


In [35]:
new1 = iris.drop([2, 3, 4], axis = 0)
new1.head()

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
0,5.1,3.5,1.4,0.2,Setosa
1,4.9,3.0,1.4,0.2,Setosa
5,5.4,3.9,1.7,0.4,Setosa
6,4.6,3.4,1.4,0.3,Setosa
7,5.0,3.4,1.5,0.2,Setosa


![image](images/15.png)

In [36]:
iris.min()

sepal.length       4.3
sepal.width        2.0
petal.length       1.0
petal.width        0.1
variety         Setosa
dtype: object

In [37]:
iris.max()

sepal.length          7.9
sepal.width           4.4
petal.length          6.9
petal.width           2.5
variety         Virginica
dtype: object

In [40]:
# iris.mean() this will throw an error as "variety" column has non-numeric values
# You need to tell pandas to only use the numeric columns when calculating the mean:
iris.mean(numeric_only = True)

sepal.length    5.843333
sepal.width     3.057333
petal.length    3.758000
petal.width     1.199333
dtype: float64

Alternate fix (manual selection):

If you want to be extra clear, you can also select the numeric columns explicitly:
```python
iris.select_dtypes(include='number').mean()

In [41]:
iris.median(numeric_only = True)

sepal.length    5.80
sepal.width     3.00
petal.length    4.35
petal.width     1.30
dtype: float64

![image](images/16.png)

In [43]:
def double(a):
    return a * 2

iris[["sepal.width", "petal.width"]].apply(double)

Unnamed: 0,sepal.width,petal.width
0,7.0,0.4
1,6.0,0.4
2,6.4,0.4
3,6.2,0.4
4,7.2,0.4
...,...,...
145,6.0,4.6
146,5.0,3.8
147,6.0,4.0
148,6.8,4.6


![image](images/17.png)

In [44]:
iris["variety"].value_counts()

variety
Setosa        50
Versicolor    50
Virginica     50
Name: count, dtype: int64

In [47]:
iris.sort_values(by = "petal.width") # sort the data with petal.width column

Unnamed: 0,sepal.length,sepal.width,petal.length,petal.width,variety
32,5.2,4.1,1.5,0.1,Setosa
13,4.3,3.0,1.1,0.1,Setosa
37,4.9,3.6,1.4,0.1,Setosa
9,4.9,3.1,1.5,0.1,Setosa
12,4.8,3.0,1.4,0.1,Setosa
...,...,...,...,...,...
140,6.7,3.1,5.6,2.4,Virginica
114,5.8,2.8,5.1,2.4,Virginica
100,6.3,3.3,6.0,2.5,Virginica
144,6.7,3.3,5.7,2.5,Virginica


In [6]:
df = pd.read_csv("datasets/diamonds.csv")
df.head(2)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31


We have an "Unnamed: 0" column which is the original index from when the CSV was saved, and pandas has created a new default index (0, 1, 2, 3, 4) on the left side.

To fix this, use `index_col=0` when reading the CSV:
```python
df = pd.read_csv("datasets/diamonds.csv", index_col=0)
df.head()
```
This will:

- Use the "Unnamed: 0" column as the actual index
- Remove the "Unnamed: 0" column from appearing as a data column
- Keep only one index column instead of having both the default index and the "Unnamed: 0" column

In [4]:
# Check the first few lines of the raw CSV
with open("datasets/diamonds.csv", 'r') as f:
    for i in range(5):
        print(f"Line {i}: {f.readline().strip()}")

Line 0: "","carat","cut","color","clarity","depth","table","price","x","y","z"
Line 1: "1",0.23,"Ideal","E","SI2",61.5,55,326,3.95,3.98,2.43
Line 2: "2",0.21,"Premium","E","SI1",59.8,61,326,3.89,3.84,2.31
Line 3: "3",0.23,"Good","E","VS1",56.9,65,327,4.05,4.07,2.31
Line 4: "4",0.29,"Premium","I","VS2",62.4,58,334,4.2,4.23,2.63


To get rid of unwanted first unnecessary index column

In [7]:
# Option 1: Skip the first column when reading
df = pd.read_csv("datasets/diamonds.csv", index_col=0) # I used this one
df.head(2)

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31


In [None]:
# Option 2: Drop the unwanted index column after reading
df = pd.read_csv("datasets/diamonds.csv")
df = diamonds.drop(diamonds.columns[0], axis=1)  # drops first column
df.head(2)

In [None]:
# Option 3: Reset the index if you want a clean integer index
# if you want a clean integer index starting from 0 and remove that "Unnamed: 0" column entirely
df = pd.read_csv("datasets/diamonds.csv", index_col=0)
df = diamonds.reset_index(drop=True)
df.head(2)

The most common solution is **Option 1** using `index_col=0`, which tells pandas to use the first column as the index rather than creating a new default integer index alongside it.

You can get a summary of a pandas dataset using several methods, depending on what type of summary you need:

#### Basic Dataset Information

In [None]:
# Get basic info about the dataset
df.info()  # Shows data types, non-null counts, memory usage

# Shape of the dataset
df.shape  # Returns (rows, columns)

# Column names and data types
df.dtypes

#### Statistical Summary

In [None]:
# Statistical summary for numerical columns
df.describe()  # Mean, std, min, max, quartiles

# Include categorical columns too
df.describe(include='all')

# Summary statistics for specific columns
df['column_name'].describe()

#### Quick Overview

In [None]:
# First/last few rows
df.head()     # First 5 rows (default)
df.tail(10)   # Last 10 rows

# Random sample
df.sample(5)  # 5 random rows

#### Missing Data Summary

In [None]:
# Count missing values
df.isnull().sum()

# Percentage of missing values
df.isnull().sum() / len(df) * 100

# Total missing values
df.isnull().sum().sum()

#### Value Counts and Unique Values

In [None]:
# For categorical columns
df['column_name'].value_counts()

# Unique values in each column
df.nunique()

# See unique values
df['column_name'].unique()

#### Memory Usage

In [None]:
# Memory usage by column
df.memory_usage(deep=True)

# Total memory usage
df.memory_usage(deep=True).sum()

For a comprehensive overview, you might combine several of these:

In [None]:
print("Dataset Shape:", df.shape)
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nStatistical Summary:")
print(df.describe())

The `describe()` method is usually the most useful starting point as it gives you key statistics for all numerical columns at once.

For more deeper insight we can use below code for more comprehensive overview.

In [None]:
print("="*50)
print("COMPREHENSIVE DATASET OVERVIEW")
print("="*50)

# Basic Info
print("Dataset Shape:", df.shape)
print(f"Total cells: {df.size:,}")
print(f"Total memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

print("\n" + "="*30)
print("DATA TYPES & STRUCTURE")
print("="*30)
print(df.dtypes)
print(f"\nData type distribution:")
print(df.dtypes.value_counts())

# Detailed info
print("\n" + "="*30)
print("DETAILED COLUMN INFO")
print("="*30)
df.info()

print("\n" + "="*30)
print("MISSING VALUES ANALYSIS")
print("="*30)
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Missing %': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0])  # Only show columns with missing values

print("\n" + "="*30)
print("UNIQUE VALUES COUNT")
print("="*30)
print(df.nunique().sort_values(ascending=False))

print("\n" + "="*30)
print("STATISTICAL SUMMARY")
print("="*30)
print(df.describe(include='all'))

# Additional insights for categorical columns
print("\n" + "="*30)
print("CATEGORICAL COLUMNS PREVIEW")
print("="*30)
categorical_cols = df.select_dtypes(include=['object']).columns
for col in categorical_cols[:5]:  # Show first 5 categorical columns
    print(f"\n{col}:")
    print(df[col].value_counts().head())

# Numerical columns distribution
print("\n" + "="*30)
print("NUMERICAL COLUMNS DISTRIBUTION")
print("="*30)
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
for col in numerical_cols:
    print(f"\n{col}: Min={df[col].min()}, Max={df[col].max()}, "
          f"Median={df[col].median()}, Skew={df[col].skew():.2f}")

print("\n" + "="*30)
print("DUPLICATE ROWS")
print("="*30)
duplicates = df.duplicated().sum()
print(f"Total duplicate rows: {duplicates}")
if duplicates > 0:
    print(f"Percentage of duplicates: {(duplicates/len(df))*100:.2f}%")

You can see all column names in a pandas DataFrame using several methods:
### Most Common Methods
#### 1. Using .columns attribute:

In [None]:
df.columns # This returns an Index object with all column names.

#### 2. Convert to list:

In [None]:
df.columns.tolist()   # These gives you a regular Python list of column names.

# or

list(df.columns)

#### 3. Print all columns nicely:

In [None]:
for col in df.columns:
    print(col)

#### 4. For Large DataFrames

In [None]:
pd.set_option('display.max_columns', None) # If you have many columns and want to see them all without truncation:
print(df.columns.tolist())

The .columns attribute is usually the quickest way to get what you need!