# **What is Pandas and Why to use it?**
Pandas is a powerful Python library used for data manipulation and analysis. It provides easy-to-use data structures like Series and DataFrames, which help in handling tabular data (like spreadsheets or SQL tables). If you're working with structured data—whether it's from a CSV file, an Excel sheet, or a database—Pandas makes it easy to clean, explore, and process your data efficiently.

In [None]:
import pandas as pd
import numpy as np

#### **Series**
A one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).

In [None]:
data1 = [5,4]
index=['a', 'b']
s = pd.Series(data1) # the length of data and index should be same, if you don’t provide an index, pandas assigns default integer indices (0, 1, 2, ...)
data2 = {'a': 1, 'b': 2} # a and b will be taken as index
s = pd.Series(data2, index=['b', 'c', 'd', 'a']) # if u still specify index, the values in data2 corresponding to the labels in the index will be pulled out
print(s.shape) # (length)
print(s.head(2)) # return the given no of rows from the top
print(s.tail(2)) # returns the given no of rows from bottom
print(s.info()) # Summary of the DataFrame (data types, missing values, etc.)
print(s.describe()) # Statistical summary of numeric columns

#### **DataFrames**
DataFrame is a 2-dimensional labeled data structure with columns of different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

In [None]:
df = pd.DataFrame([[1, 2, 3], [4, 5, 6, 10], [7, 9]],  
                  columns=['a', 'b', 'c', 'z'],  
                  index=['d', 'e', 'f'])  
df.head(1) # return the given no of rows from the top
df.tail(2) # return the given no of rows from the bottom
print(df.shape) # (Rows, Columns)  
print(df.size) # Total number of elements  
print(df.info()) # Summary of the DataFrame (data types, missing values, etc.)
print(df.describe()) # Statistical summary of numeric columns  

#### **Handling Missing Values (NaN) in Pandas**
Pandas provides several ways to deal with missing values (NaN).

In [None]:
# Replacing Missing Values with a Specific Value (fillna)
df_filled = df.fillna({'z': 5}) # replaces NaN values only in the column 'z' with 5.

# Filling Missing Values Using Forward Fill (ffill)
# ffill (forward fill): Fills missing values with the last valid value above
# axis=0 (rows): Works row-wise (fills down)
# limit=1: Fills at most one missing value per column if axis=0 or per row if axis=1
df_ffill = df.fillna(method='ffill', axis=0, limit=1)

# Filling Missing Values Using Backward Fill (bfill)
# bfill (backward fill): Fills missing values with the next valid value below.
# if there’s no value below (e.g., last row), the missing value remains NaN
df_bfill = df.fillna(method='bfill')

# Removing Rows or Columns with Missing Values (dropna)
# this removes rows where any value is NaN.
# to drop columns instead, use df.dropna(axis=1).
df_cleaned = df.dropna()


#### **Data Selection and Manipulation**

In [None]:
df = pd.DataFrame([[1, 2, 7], [8, 5, 9], [4, 3, 6]],  
                  columns=['a', 'b', 'c'],  
                  index=['d', 'e', 'f'])  
df['a'] # you can easily view all the columns seperately like this
df[['a', 'c']] # you can also provide a list of columns to view multiple columns

# DF.LOC()
df.loc[['d']] # Access a group of rows and columns by label(s) or a boolean array. a single label
df.loc[['d', 'e']] # a list or array of labels
df.loc['d':'f'] # a slice object with labels
df.loc[[True, False, True]] # a boolean array of the same length as the axis
df.loc[df['a'] % 2 == 0] # conditional that returns a boolean Series
df.loc[df['a'] > 2] # conditional that returns a boolean Series

# DF.ILOC()
df.iloc[1] # return the columns of the given index/row
df.iloc[[0,1,2]] # list or array of integers
df.iloc[0:3] # slice object with ints
df.iloc[[True, False, True]] # a boolean array of the same length as the axis

# ADDING NEW COLUMNS
df['z'] = [10, 11, 12]
df['x'] = df['b'] + df['c']

# Removing columns and rows
df = df.drop(['x'], axis=1) # a list of columns or index u want to delete and the axis {0 or ‘index’, 1 or ‘columns’}, default 0

# SORTING DATA
df.sort_values(axis=1, by='e') # by = column if axis = 0, index if axis = 1
df['y'] = df['a'] + df['b']
df.sort_index(axis=1)

# RESETTING AND SETTING INDEX
df = pd.DataFrame({'month': [1, 4, 7, 10],
                   'year': [2012, 2014, 2013, 2014],
                   'sale': [55, 40, 84, 31]})
df.set_index('month') # Set the index to become the ‘month’ column
df.set_index(['year', 'month']) # Create a MultiIndex using columns ‘year’ and ‘month’
df.set_index([pd.Index([1, 2, 3, 4]), 'month']) # Create a MultiIndex using an Index and a column


#### **Data Cleaning & Preprocessing**

In [None]:
df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                   (np.nan, 2.0, np.nan, np.nan),
                   (2.0, 3.0, np.nan, 9.0),
                   (np.nan, 4.0, -4.0, 16.0)],
                  columns=list('abcd'))
df.interpolate(axis=0, limit_direction='both') # make note for this, why 'forward' and 'backward' it not usefull and why using 'both'

# HANDLING DUPLICATES
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 5, 5]
})
df.duplicated() # Return boolean Series denoting duplicate rows.
df.drop_duplicates()
df.drop_duplicates(subset=['brand']) # remove duplicates on specific columns

# CHANGING DATATYPES
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df.astype('int32').dtypes # all columns to int32
df.astype({'col1': 'int32'}).dtypes # col1 to int32 using a dictionary

# RENAMING COLUMNS
df.rename(columns={'col1': 'A', 'col2' : 'B'})
df.rename(index={0: 'A', 1 : 'B'})

# REPLACING VALUES
df.replace([1,2,3],5) # replaces all the given values in the list with 5
df.replace([1,2], [6,7]) # Replace 1 with 6, 2 with 7

# APPLYING FUNCTIONS
df = df.replace([1,2,3,4], [1,4,9,16])
df.apply(np.sqrt) # Apply square root to all values
df.apply(np.sum, axis=1) # Sum rows

#### **Grouping & Aggregation**

In [None]:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})
df.groupby('Animal').mean()  # Group by category and calculate mean
df.groupby('Animal').sum()  # Group by category and calculate sum
df.groupby('Animal').count()  # Group by category and Count occurrences
df.agg(['sum', 'max'])

#### **Merging and Combining Dataframes**

In [None]:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])
pd.concat([df1, df2]) # Stack DataFrames on top of each other

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})
df1.merge(df2, left_on='lkey', right_on='rkey')

df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2']}, index=['K1', 'K2', 'K3'])
df2 = pd.DataFrame({'B': ['B1', 'B2', 'B3']}, index=['K1', 'K2', 'K4'])

result = df1.join(df2)  # Default: LEFT JOIN on index
result