<div style="background-color: rgba(247, 200, 115, 0.3); padding: 30px 0;">
    <div style="max-width: 800px; margin: 0 auto; text-align: center;">
        <h1 style="font-size: 48px; color: #cc7a00; margin-bottom: 10px;">🚀 Machine Learning 📊</h1>
        <h3 style="font-size: 28px; color: #cc7a00; margin-bottom: 10px;">Numpy & Pandas</h3>
        <h4 style="font-size: 18px; color: #cc7a00;"><a href="https://www.linkedin.com/in/mohammadreza-qaderi/" style="color: #1e90ff; text-decoration: none;">MohammadReza Qaderi</a></h4>
        <h4 style="font-size: 18px; color: #cc7a00;"><a href="https://github.com/MR-Qaderi/MachineLearningCourseMaterials" style="color: #1e90ff; text-decoration: none;">GitHub Repository</a></h4>
    </div>
</div>


<div style="background-color: #f9f9f9; border: 1px solid #ccc; padding: 10px; border-radius: 10px; font-family: Arial;">

<h1 style="color: #333; text-align: center;">Importance of Numpy and Pandas in Machine Learning</h1>

<p>In this session, we delved into the fundamental concepts of two essential libraries for machine learning: <strong>Numpy</strong> and <strong>Pandas</strong>.</p>

<h2>Numpy:</h2>

<p>Numpy, short for Numerical Python, is the backbone of many numerical computing and data science libraries in Python. It provides support for arrays, matrices, and a collection of mathematical functions to operate on these data structures. This efficiency in numerical operations is crucial in handling large datasets, and it forms the foundation for various machine learning algorithms.</p>

<h2>Pandas:</h2>

<p>Pandas is a versatile library that simplifies data manipulation and analysis in Python. It introduces the DataFrame, a powerful data structure that allows for efficient storage, retrieval, and manipulation of data. With Pandas, we can easily clean, transform, and prepare our data for machine learning tasks, making it an invaluable tool in the data preprocessing pipeline.</p>

<p>Both Numpy and Pandas are considered essential skills for any aspiring data scientist or machine learning practitioner. They provide the necessary tools to efficiently manage and process data, setting the stage for more advanced machine learning techniques. Throughout this notebook, we'll continue to leverage the capabilities of these libraries to build robust machine learning models.</p>

</div>

<h1 style = "font-size:3rem;color:orange;"> Numpy

### Installing numpy

In [55]:
# !pip install numpy

website: https://numpy.org/

## Why numpy?

NumPy (or Numpy) is a Linear Algebra Library for Python, the reason it is so important for Data Science with Python is that almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main building blocks.
Numpy is also incredibly fast, as it has bindings to C libraries.

## Import

In [2]:
import numpy as np

In [4]:
np.__version__

'1.21.5'

## Arrays

In [5]:
a = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])
print(a)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


<img src = "https://media.geeksforgeeks.org/wp-content/uploads/Numpy1.jpg" style="width: 70%">

In [6]:
print(np.where(a > 5, a, 0))

[[0 0 0]
 [0 0 6]
 [7 8 9]]


<img src = "https://numpy.org/devdocs/_images/np_indexing.png" style="width: 100%">

In [7]:
a[2,1]

8

In [9]:
a.shape

(3, 3)

In [10]:
print(a.shape)

(3, 3)


In [13]:
b = np.array([[1, 2, 3], [4, 5, 6]])

In [14]:
b

array([[1, 2, 3],
       [4, 5, 6]])

In [15]:
b.shape

(2, 3)

In [18]:
print(np.reshape(b, (3, 2)))

[[1 2]
 [3 4]
 [5 6]]


In [19]:
print(np.reshape(b, (1, -1)))

[[1 2 3 4 5 6]]


In [20]:
print(np.reshape(a, (-1, 1)))

[[1]
 [2]
 [3]
 [4]
 [5]
 [6]
 [7]
 [8]
 [9]]


<img src="https://fgnt.github.io/python_crashkurs_doc/_images/numpy_array_t.png" width="50%"/>

In [21]:
print(a.ndim)

2


In [22]:
print(a.dtype)

int32


## arange

In [23]:
c = np.arange(1, 30, step = 3)
print(c)

[ 1  4  7 10 13 16 19 22 25 28]


## linspace

In [24]:
d = np.linspace(1, 2, 5)
print(d)

[1.   1.25 1.5  1.75 2.  ]


## specific arrays

In [25]:
print(np.ones(shape=(3, 2)))

[[1. 1.]
 [1. 1.]
 [1. 1.]]


In [26]:
print(np.zeros(shape=(2, 3)))

[[0. 0. 0.]
 [0. 0. 0.]]


In [28]:
print(5. * np.ones(shape=(3, 2)))

[[5. 5.]
 [5. 5.]
 [5. 5.]]


In [27]:
print(np.full((3, 2), 5))

[[5 5]
 [5 5]
 [5 5]]


In [28]:
print(np.eye(4))

[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]


In [29]:
print(np.random.rand(3, 2))

[[0.90059021 0.23836418]
 [0.64051753 0.41002688]
 [0.08950529 0.09139838]]


## Selection

In [30]:
T = np.arange(1,11)

In [32]:
T

array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])

In [33]:
T > 4

array([False, False, False, False,  True,  True,  True,  True,  True,
        True])

In [35]:
T2 = T[T > 4]
T2

array([ 5,  6,  7,  8,  9, 10])

## Operations

In [36]:
x = np.random.rand(4,3)
y = np.random.rand(4,3)

In [37]:
print(x)
print()
print(y)

[[0.68966522 0.39496837 0.77639705]
 [0.42209854 0.9514051  0.67850768]
 [0.30769158 0.57857793 0.25826052]
 [0.76083852 0.47288649 0.22019645]]

[[0.82641676 0.53563443 0.86595174]
 [0.28841206 0.46571789 0.44606197]
 [0.76965962 0.2412172  0.61322125]
 [0.38163945 0.35941376 0.74852153]]


In [38]:
print(x + y)
print()
print(np.add(x, y))

[[1.51608199 0.9306028  1.64234878]
 [0.7105106  1.41712299 1.12456965]
 [1.07735121 0.81979513 0.87148177]
 [1.14247797 0.83230025 0.96871798]]

[[1.51608199 0.9306028  1.64234878]
 [0.7105106  1.41712299 1.12456965]
 [1.07735121 0.81979513 0.87148177]
 [1.14247797 0.83230025 0.96871798]]


In [39]:
print(x - y)
print()
print(np.subtract(x, y))

[[-0.53862795  0.21759763 -0.06693401]
 [ 0.27992044 -0.4165199  -0.81726927]
 [ 0.4988415   0.29885633  0.36709308]
 [-0.15088003 -0.11441511  0.8469572 ]]

[[-0.53862795  0.21759763 -0.06693401]
 [ 0.27992044 -0.4165199  -0.81726927]
 [ 0.4988415   0.29885633  0.36709308]
 [-0.15088003 -0.11441511  0.8469572 ]]


In [39]:
print(x * y)
print()
print(np.multiply(x, y))

[[0.5699509  0.21155866 0.67232237]
 [0.12173831 0.44308637 0.30265647]
 [0.23681779 0.13956295 0.15837084]
 [0.29036599 0.16996191 0.16482178]]

[[0.5699509  0.21155866 0.67232237]
 [0.12173831 0.44308637 0.30265647]
 [0.23681779 0.13956295 0.15837084]
 [0.29036599 0.16996191 0.16482178]]


In [40]:
print(x / y)
print()
print(np.divide(x, y))

[[0.83452473 0.73738422 0.89658235]
 [1.46352597 2.04287859 1.52110632]
 [0.39977618 2.39857661 0.42115389]
 [1.99360555 1.31571616 0.29417517]]

[[0.83452473 0.73738422 0.89658235]
 [1.46352597 2.04287859 1.52110632]
 [0.39977618 2.39857661 0.42115389]
 [1.99360555 1.31571616 0.29417517]]


In [42]:
print(np.sqrt(x))

[[0.83046085 0.62846509 0.88113395]
 [0.64969111 0.97539997 0.82371578]
 [0.55469954 0.7606431  0.50819338]
 [0.87226058 0.68766743 0.46925095]]


In [44]:
z = np.concatenate((x, y))
z

array([[0.68966522, 0.39496837, 0.77639705],
       [0.42209854, 0.9514051 , 0.67850768],
       [0.30769158, 0.57857793, 0.25826052],
       [0.76083852, 0.47288649, 0.22019645],
       [0.82641676, 0.53563443, 0.86595174],
       [0.28841206, 0.46571789, 0.44606197],
       [0.76965962, 0.2412172 , 0.61322125],
       [0.38163945, 0.35941376, 0.74852153]])

<img src="https://numpy.org/devdocs/_images/np_aggregation.png" /> 

In [45]:
z.sum()

13.053361107310119

In [46]:
z.max()

0.9514051008282509

<img src="https://numpy.org/devdocs/_images/np_matrix_aggregation_row.png" /> 

In [47]:
# Creating two 2D NumPy arrays (3x4)
array1 = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

array2 = np.array([[13, 14, 15, 16],
                   [17, 18, 19, 20],
                   [21, 22, 23, 24]])

In [48]:
# Concatenating along axis=1 (columns)
concatenated_columns = np.concatenate((array1, array2), axis=1)

print("Array1:")
print(array1)
print("Array2:")
print(array2)
print("Concatenated along columns:")
print(concatenated_columns)


Array1:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]
Array2:
[[13 14 15 16]
 [17 18 19 20]
 [21 22 23 24]]
Concatenated along columns:
[[ 1  2  3  4 13 14 15 16]
 [ 5  6  7  8 17 18 19 20]
 [ 9 10 11 12 21 22 23 24]]


In [49]:
# Transposing arrays to demonstrate axis=0
transposed_array1 = array1.T
transposed_array2 = array2.T

# Concatenating along axis=0 (rows)
concatenated_rows = np.concatenate((transposed_array1, transposed_array2), axis=0)

print("Transposed Array1:")
print(transposed_array1)
print("Transposed Array2:")
print(transposed_array2)
print("Concatenated along rows:")
print(concatenated_rows)

Transposed Array1:
[[ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]
 [ 4  8 12]]
Transposed Array2:
[[13 17 21]
 [14 18 22]
 [15 19 23]
 [16 20 24]]
Concatenated along rows:
[[ 1  5  9]
 [ 2  6 10]
 [ 3  7 11]
 [ 4  8 12]
 [13 17 21]
 [14 18 22]
 [15 19 23]
 [16 20 24]]


<h1 style = "font-size:3rem;color:orange;"> Pandas

In [2]:
import pandas as pd

## Data Structures in Pandas

In [51]:
# Series
data_series = pd.Series([10, 20, 30, 40, 50])
data_series

0    10
1    20
2    30
3    40
4    50
dtype: int64

In [52]:
# DataFrame
data_dict = {'Name': ['John', 'Alice', 'Bob'], 'Age': [25, 30, 22]}
data_df = pd.DataFrame(data_dict)
data_df

Unnamed: 0,Name,Age
0,John,25
1,Alice,30
2,Bob,22


## Reading and Writing Data

### Uploading and Reading Data in Jupyter Notebook

The line `df = pd.read_csv('diabetes1.csv')` demonstrates how to read data in a Jupyter Notebook.

1. **Uploading Data**: In a Jupyter Notebook, you can upload data files directly. Once uploaded, you can access them using their file names.

2. **Reading Data**: The `pd.read_csv(...)` function from the pandas library is used to read CSV files. In this case, it's reading a file named `'diabetes1.csv'`.

   - `df`: This is a common convention for naming DataFrames. It stands for 'data frame' and is a popular term used in data analysis.

By executing this line, the content of `'diabetes1.csv'` will be loaded into the DataFrame `df`, making it ready for analysis and manipulation within the Jupyter Notebook environment.


In [3]:
# Reading CSV
df = pd.read_csv('diabetes1.csv')
df

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


The line `titanic = pd.read_csv('D:/Machine Learning/Datasets/titanic.csv')` reads data from a CSV file located at the specified path (`'D:/Machine Learning/Datasets/titanic.csv'`) on the user's local machine.

In [55]:
# Reading CSV
titanic = pd.read_csv('D:/Machine Learning/Datasets/titanic.csv')
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.00,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.00,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.00,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.50,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.50,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.00,0,0,2670,7.2250,,C,,,


In [60]:
# Writing to CSV
data_df.to_csv('output.csv', index=False)

## Exploring Data

In [57]:
# Getting the first few rows
df.head(6)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148.0,72.0,35.0,,33.6,0.627,50,1
1,1,85.0,66.0,29.0,,26.6,0.351,31,0
2,8,183.0,64.0,,,23.3,0.672,32,1
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
5,5,116.0,74.0,,,25.6,0.201,30,0


In [59]:
df.tail(11)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
757,0,123.0,72.0,,,36.3,0.258,52,1
758,1,106.0,76.0,,,37.5,0.197,26,0
759,6,190.0,92.0,,,35.5,0.278,66,1
760,2,88.0,58.0,26.0,16.0,28.4,0.766,22,0
761,9,170.0,74.0,31.0,,44.0,0.403,43,1
762,9,89.0,62.0,,,22.5,0.142,33,0
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,0
764,2,122.0,70.0,27.0,,36.8,0.34,27,0
765,5,121.0,72.0,23.0,112.0,26.2,0.245,30,0
766,1,126.0,60.0,,,30.1,0.349,47,1


In [60]:
type(df)

pandas.core.frame.DataFrame

In [61]:
# Basic information about the DataFrame
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   763 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB
None


In [62]:
# Summary statistics for numeric columns
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,763.0,733.0,541.0,394.0,757.0,768.0,768.0,768.0
mean,3.845052,121.686763,72.405184,29.15342,155.548223,32.457464,0.471876,33.240885,0.348958
std,3.369578,30.535641,12.382158,10.476982,118.775855,6.924988,0.331329,11.760232,0.476951
min,0.0,44.0,24.0,7.0,14.0,18.2,0.078,21.0,0.0
25%,1.0,99.0,64.0,22.0,76.25,27.5,0.24375,24.0,0.0
50%,3.0,117.0,72.0,29.0,125.0,32.3,0.3725,29.0,0.0
75%,6.0,141.0,80.0,36.0,190.0,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [60]:
# Count unique values in a column
print(titanic['embarked'].value_counts())

S    914
C    270
Q    123
Name: embarked, dtype: int64


In [67]:
pd.crosstab(titanic.sex, titanic.survived)

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,127,339
male,682,161


In [68]:
pd.crosstab(titanic.sex, titanic.survived, normalize = "index" )

survived,0,1
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
female,0.272532,0.727468
male,0.809015,0.190985


## Data Cleaning and Handling Missing Data

### Dropping Missing Values

<font size = 4> When to Use: <font size = 3.5> Dropping missing values is suitable when the number of missing values is small, and removing them does not significantly impact the analysis or model.

In [63]:
# Remove rows with any missing value
df_cleaned = df.dropna()

In [70]:
# Remove columns with any missing value
# df_cleaned = df.dropna(axis=1)

In [64]:
df_cleaned

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
3,1,89.0,66.0,23.0,94.0,28.1,0.167,21,0
4,0,137.0,40.0,35.0,168.0,43.1,2.288,33,1
6,3,78.0,50.0,32.0,88.0,31.0,0.248,26,1
8,2,197.0,70.0,45.0,543.0,30.5,0.158,53,1
13,1,189.0,60.0,23.0,846.0,30.1,0.398,59,1
...,...,...,...,...,...,...,...,...,...
753,0,181.0,88.0,44.0,510.0,43.3,0.222,26,1
755,1,128.0,88.0,39.0,110.0,36.5,1.057,37,1
760,2,88.0,58.0,26.0,16.0,28.4,0.766,22,0
763,10,101.0,76.0,48.0,180.0,32.9,0.171,63,0


In [65]:
df_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 3 to 765
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               392 non-null    int64  
 1   Glucose                   392 non-null    float64
 2   BloodPressure             392 non-null    float64
 3   SkinThickness             392 non-null    float64
 4   Insulin                   392 non-null    float64
 5   BMI                       392 non-null    float64
 6   DiabetesPedigreeFunction  392 non-null    float64
 7   Age                       392 non-null    int64  
 8   Outcome                   392 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 30.6 KB


In [66]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   763 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       757 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB


In [69]:
df.isnull().sum()

Pregnancies                   0
Glucose                       5
BloodPressure                35
SkinThickness               227
Insulin                     374
BMI                          11
DiabetesPedigreeFunction      0
Age                           0
Outcome                       0
dtype: int64

In [70]:
df_cleaned.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

### Imputing with Mean or Median

<font size = 4> When to Use: <font size = 3.5> Imputing with mean or median is suitable when the missing values are in numerical columns with a relatively normal distribution and limited outliers.

In [71]:
# Impute missing values with the mean of the column
df['BMI'] = df['BMI'].fillna(df['BMI'].mean())

In [73]:
# Impute missing values with the median of the column
df['Glucose'] = df['Glucose'].fillna(df['Glucose'].median())

In [74]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    float64
 2   BloodPressure             733 non-null    float64
 3   SkinThickness             541 non-null    float64
 4   Insulin                   394 non-null    float64
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(6), int64(3)
memory usage: 54.1 KB


### Imputing Missing Values: Mean vs. Median

**Use Mean When:**
- Continuous Variables
- Less Influence by Outliers

**Use Median When:**
- Ordinal or Categorical Variables
- Presence of Outliers

---

<kbd style="border: 1px solid #ccc; padding: 10px; display: block;">
Choose the mean when dealing with normally distributed data without significant outliers, and when the mean is the appropriate measure of central tendency. Choose the median when dealing with skewed data, outliers, or ordinal/categorical variables, and when preserving the order of data is important. It's also a good practice to consider the nature of the data, the potential impact on analysis or modeling, and the context of the problem you are trying to solve.
</kbd>


In [75]:
# Example 1: Mean vs. Median for Data with Outliers
salaries = np.array([40000, 45000, 47000, 42000, 500000])
mean_salary = np.mean(salaries)
median_salary = np.median(salaries)

print("Salaries:", salaries)
print("Mean Salary:", mean_salary)
print("Median Salary:", median_salary)

Salaries: [ 40000  45000  47000  42000 500000]
Mean Salary: 134800.0
Median Salary: 45000.0


### Imputing with Mode

<font size = 4> When to Use: <font size = 3.5> Imputing with mode is suitable for categorical columns when the majority of the data belongs to a specific category.

In [76]:
# Sample DataFrame with missing data
data = {
    'ID': [1, 2, 3, 4, 5],
    'Age': [25, None, 22, 28, None],
    'City': ['New York', 'London', None, 'Berlin', 'London']
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,Age,City
0,1,25.0,New York
1,2,,London
2,3,22.0,
3,4,28.0,Berlin
4,5,,London


In [78]:
# Imputing missing values in the 'City' column with the mode
mode_city = df['City'].mode().iloc[0]
df['City'].fillna(mode_city, inplace=True)

# Display the results
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,ID,Age,City
0,1,25.0,New York
1,2,,London
2,3,22.0,London
3,4,28.0,Berlin
4,5,,London


### Advanced Imputation Techniques

<font size = 4> When to Use: <font size = 3.5> For more complex cases, advanced imputation techniques like K-nearest neighbors (KNN), regression imputation, or multiple imputations can be used.

<div style="text-align: center; font-size: 18px; color: red; border: 2px solid black; padding: 10px;">
    We will get to know more about this topic in future meetings
</div>

In [79]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=3)
df_missing_knn = pd.DataFrame(imputer.fit_transform(df_cleaned), columns=df_cleaned.columns)

In [80]:
df_missing_knn.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

## Indexing and Slicing

In [81]:
# Sample DataFrame
data = {
    'Name': ['John', 'Alice', 'Bob', 'Eva', 'Michael'],
    'Age': [25, 30, 22, 28, 35],
    'City': ['New York', 'London', 'Paris', 'Berlin', 'Los Angeles']
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,City
0,John,25,New York
1,Alice,30,London
2,Bob,22,Paris
3,Eva,28,Berlin
4,Michael,35,Los Angeles


In [82]:
# Using loc to access rows and columns by labels
print("Using loc:")
print(df.loc[2])  # Access row with label 2

Using loc:
Name      Bob
Age        22
City    Paris
Name: 2, dtype: object


In [88]:
print(df.loc[1:3])  # Access rows with labels 1 to 3 (inclusive)

    Name  Age    City
1  Alice   30  London
2    Bob   22   Paris
3    Eva   28  Berlin


In [89]:
print(df.loc[:, 'Name'])  # Access 'Name' column using label

0       John
1      Alice
2        Bob
3        Eva
4    Michael
Name: Name, dtype: object


In [90]:
print(df.loc[:, ['Name', 'Age']])  # Access multiple columns using labels

      Name  Age
0     John   25
1    Alice   30
2      Bob   22
3      Eva   28
4  Michael   35


In [99]:
# Using iloc to access rows and columns by integer positions
print("\nUsing iloc:")
print(df.iloc[2])  # Access row at index 2


Using iloc:
Name      Bob
Age        22
City    Paris
Name: 2, dtype: object


In [100]:
print(df.iloc[1:4])  # Access rows at positions 1 to 3 (exclusive)

    Name  Age    City
1  Alice   30  London
2    Bob   22   Paris
3    Eva   28  Berlin


In [101]:
print(df.iloc[:, 0])  # Access first column using integer position

0       John
1      Alice
2        Bob
3        Eva
4    Michael
Name: Name, dtype: object


In [102]:
print(df.iloc[:, [0, 1]])  # Access multiple columns using integer positions

      Name  Age
0     John   25
1    Alice   30
2      Bob   22
3      Eva   28
4  Michael   35


In [103]:
print(df.iloc[1:4, 0:3])  # Access rows at positions 1 to 3 and columns at positions 0 to 2 (exclusive)

    Name  Age    City
1  Alice   30  London
2    Bob   22   Paris
3    Eva   28  Berlin


## Handling Categorical Data

In [85]:
# Sample DataFrame with categorical data
data = {
    'ID': [1, 2, 3, 4, 5],
    'City': ['New York', 'London', 'Paris', 'London', 'Tokyo'],
    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female']
}

df = pd.DataFrame(data)
df

Unnamed: 0,ID,City,Gender
0,1,New York,Male
1,2,London,Female
2,3,Paris,Male
3,4,London,Male
4,5,Tokyo,Female


<img src = "https://miro.medium.com/v2/resize:fit:1400/1*O_pTwOZZLYZabRjw3Ga21A.png" style="width: 100%">

In [86]:
# One-hot encoding for the 'City' column
df_encoded1 = pd.get_dummies(df, columns=['City'])

# One-hot encoding for the 'Gender' column
df_encoded2 = pd.get_dummies(df_encoded1, columns=['Gender'])

In [89]:
# Display the results
print("Original DataFrame:")
df

Original DataFrame:


Unnamed: 0,ID,City,Gender
0,1,New York,Male
1,2,London,Female
2,3,Paris,Male
3,4,London,Male
4,5,Tokyo,Female


In [90]:
print("\nDataFrame after One-Hot Encoding:")
df_encoded1


DataFrame after One-Hot Encoding:


Unnamed: 0,ID,Gender,City_London,City_New York,City_Paris,City_Tokyo
0,1,Male,0,1,0,0
1,2,Female,1,0,0,0
2,3,Male,0,0,1,0
3,4,Male,1,0,0,0
4,5,Female,0,0,0,1


In [92]:
print("\nDataFrame after One-Hot Encoding:")
df_encoded2


DataFrame after One-Hot Encoding:


Unnamed: 0,ID,City_London,City_New York,City_Paris,City_Tokyo,Gender_Female,Gender_Male
0,1,0,1,0,0,0,1
1,2,1,0,0,0,1,0
2,3,0,0,1,0,0,1
3,4,1,0,0,0,0,1
4,5,0,0,0,1,1,0


## Merging and Joining Data

In [93]:
# Sample DataFrames
df1 = pd.DataFrame({
    'ID': [1, 2, 3, 4],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Age': [25, 30, 22, 28]
}) 

df2 = pd.DataFrame({
    'ID': [3, 4, 5, 6],
    'City': ['Paris', 'Berlin', 'Tokyo', 'New York'],
    'Occupation': ['Artist', 'Engineer', 'Doctor', 'Writer']
})

# Merge DataFrames based on common 'ID' column (inner join)
merged_inner = pd.merge(df1, df2, on='ID')

In [94]:
# Display the results
print("DataFrame 1:")
df1

DataFrame 1:


Unnamed: 0,ID,Name,Age
0,1,Alice,25
1,2,Bob,30
2,3,Charlie,22
3,4,David,28


In [95]:
print("\nDataFrame 2:")
df2


DataFrame 2:


Unnamed: 0,ID,City,Occupation
0,3,Paris,Artist
1,4,Berlin,Engineer
2,5,Tokyo,Doctor
3,6,New York,Writer


In [96]:
print("\nMerged Inner:")
merged_inner


Merged Inner:


Unnamed: 0,ID,Name,Age,City,Occupation
0,3,Charlie,22,Paris,Artist
1,4,David,28,Berlin,Engineer
