## Data Wrangling I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g. https://www.kaggle.com). Provide a clear
description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas data frame.
4. Data Preprocessing: check for missing values in the data using pandas insult(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables
etc. Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.
In addition to the codes and outputs, explain every operation that you do in the above steps and
explain everything that you do to import/read/scrape the data set.

### 1. Import all required Python Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib as plt

### 2. Locate an Open Source Dataset from the Web

Source: https://www.kaggle.com/datasets/uciml/iris?resource=download

##### About Dataset
The Iris dataset was used in R.A. Fisher's classic 1936 paper, The Use of Multiple Measurements in Taxonomic Problems, and can also be found on the UCI Machine Learning Repository.

It includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other.

The columns in this dataset are:
Id
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species

### 3. Loading the dataset into Pandas Data Frame

In [None]:
iris = pd.read_csv('Iris.csv')
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


### 4. Data Preprocessing

In [None]:
iris.isnull()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
145,False,False,False,False,False,False
146,False,False,False,False,False,False
147,False,False,False,False,False,False
148,False,False,False,False,False,False


In [None]:
iris.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [None]:
iris.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [None]:
iris.shape

(150, 6)

In [None]:
iris.isnull().sum()

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

### 5. Data Formatting and Data Normalization

##### Normalization -> Changing the scale of data ( feature scaling )
1. Min Max Normalization/ Linear Scaling
2. Z-Score Normalization
3. Decimal Scaling
4. Power Transformation (Square-root, cube-root)
5. Log Transformation

In [None]:
# Using Min Max Normalization

def min_max_normalize(name: str):
    iris[ name ] = (iris[ name ] - iris[ name ].min()) / ( iris[ name ].max() - iris[ name ].min() )

In [None]:
min_max_normalize("SepalLengthCm")
min_max_normalize("SepalWidthCm")
min_max_normalize("PetalLengthCm")
min_max_normalize("PetalWidthCm")

In [None]:
iris.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,0.222222,0.625,0.067797,0.041667,Iris-setosa
1,2,0.166667,0.416667,0.067797,0.041667,Iris-setosa
2,3,0.111111,0.5,0.050847,0.041667,Iris-setosa
3,4,0.083333,0.458333,0.084746,0.041667,Iris-setosa
4,5,0.194444,0.666667,0.067797,0.041667,Iris-setosa


### 6. Turn categorical variables into quantitative variables in Python.

In [None]:
iris['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

In [None]:
num_code = {"Species": {"Iris-setosa":1, "Iris-versicolor":2,"Iris-virginica":3}}
iris.replace(num_code,inplace = True)

In [None]:
iris

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,0.222222,0.625000,0.067797,0.041667,1
1,2,0.166667,0.416667,0.067797,0.041667,1
2,3,0.111111,0.500000,0.050847,0.041667,1
3,4,0.083333,0.458333,0.084746,0.041667,1
4,5,0.194444,0.666667,0.067797,0.041667,1
...,...,...,...,...,...,...
145,146,0.666667,0.416667,0.711864,0.916667,3
146,147,0.555556,0.208333,0.677966,0.750000,3
147,148,0.611111,0.416667,0.711864,0.791667,3
148,149,0.527778,0.583333,0.745763,0.916667,3
