Perform the following operations using Python on any open-source dataset (e.g., data.csv)

 1. Import all the required Python Libraries.
Locate an open-source data from the web (e.g. https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site).
 2. Load the Dataset into pandas’ data frame. Data Preprocessing: check for missing values in the data using pandas isnull (), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame.
 3. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions.
 4. Turn categorical variables into quantitative variables in Python.

In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything that you do to import/read/scrape the data set.

In [2]:
import numpy as np
import pandas as pd    #for this assignment these are the only library required.

In [63]:
df = pd.read_csv("./iris.csv")

In [4]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [6]:
df.describe()          # Making use of describe function to generate *descriptive statistics* of the data frame

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [13]:
df.shape             # Gives the dimensions of the data frame


(150, 6)

In [16]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


<b>Evaluating for Missing Data</b>

The missing values are converted to Python's default. We use Python's built-in functions to identify these missing values. There are two methods to detect missing data:
<ol>
    <li>isnull()</li>
    <li>notnull()<br></li>
    </ol>
  The output is a boolean value indicating whether the value that is passed into the argument is in fact missing data. "True"     stands for missing value, while "False" stands for not missing value.<br>
<ol>
    <li>Deal with missing data<br></li>
  a. Drop data<br>
  b. Drop the whole row<br>
  c. Drop the whole column<br>
    <li>Replace data<br></li>
  a. Replace it by mean<br>
  b. Replace it by frequency / mode<br>
  c. Replace it based on other functions<br>
    </ol>

In [5]:
df.isnull()           #Checking for null values in the pandas' dataframe

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
...,...,...,...,...,...,...
145,False,False,False,False,False,False
146,False,False,False,False,False,False
147,False,False,False,False,False,False
148,False,False,False,False,False,False


In [20]:
df.isnull().sum()     # for calculating total number of null values in all columns

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

In [35]:
#for the above data set if we find there are some missing values then we can follow the below two methods to get rid of the
#null values.

avg_sepalLength = df['SepalLengthCm'].mean()        # df['SepalLengthCm'].astype("float").mean(axis = 0) 

df['SepalLengthCm'].replace(np.nan,avg_sepalLength,inplace = True)   
avg_sepalLength  

#Here we replace NAN value with the mean of SepalLength
#yahan pey joh yeh inplace = True field hain isse orignal
#dataframe affect hoti hain agar false kare toh another dataframe is returned  

5.843333333333334

In [40]:
df.dropna(subset=['PetalLengthCm'],axis = 0,inplace = True)
# over here we can observe that axis = 0 have significance to tell its about that row;
# and also the column which have to be checked for NAN values should be written in subset =[ ' column_name '];

In [43]:
df.reset_index(drop=True, inplace=True)
# over here if we keep drop = True then old indexes will be removed orelse it will be stored in a new column

In [44]:
df.isnull().sum() #check if our null values are now gone

Id               0
SepalLengthCm    0
SepalWidthCm     0
PetalLengthCm    0
PetalWidthCm     0
Species          0
dtype: int64

# Part II
# Data Standardization

In [45]:
df['SepalWidthCm'] = df['SepalWidthCm']/df['SepalWidthCm'].max()      

In [46]:
df['SepalWidthCm']

0      0.795455
1      0.681818
2      0.727273
3      0.704545
4      0.818182
         ...   
145    0.681818
146    0.568182
147    0.681818
148    0.772727
149    0.681818
Name: SepalWidthCm, Length: 150, dtype: float64

In [47]:
df['PetalWidthCm'] = df['PetalWidthCm'].astype("int")
df['PetalWidthCm']

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    1
147    2
148    2
149    1
Name: PetalWidthCm, Length: 150, dtype: int32

In [48]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,0.795455,1.4,0,Iris-setosa
1,2,4.9,0.681818,1.4,0,Iris-setosa
2,3,4.7,0.727273,1.3,0,Iris-setosa
3,4,4.6,0.704545,1.5,0,Iris-setosa
4,5,5.0,0.818182,1.4,0,Iris-setosa


In [51]:
df.tail()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
145,146,6.7,0.681818,5.2,2.0,Iris-virginica
146,147,6.3,0.568182,5.0,1.0,Iris-virginica
147,148,6.5,0.681818,5.2,2.0,Iris-virginica
148,149,6.2,0.772727,5.4,2.0,Iris-virginica
149,150,5.9,0.681818,5.1,1.0,Iris-virginica


In [53]:
df['PetalWidthCm'] = df['PetalWidthCm'].astype("int")
df['PetalWidthCm']

0      0
1      0
2      0
3      0
4      0
      ..
145    2
146    1
147    2
148    2
149    1
Name: PetalWidthCm, Length: 150, dtype: int64

In [66]:
### First convert the species column to numeric one from categorical one....
cleanUp_series = {"Species":{'Iris-virginica':1, 'Iris-setosa':2 , 'Iris-versicolor' :3}}

df.replace(cleanUp_series,inplace= True)
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,2
1,2,4.9,3.0,1.4,0.2,2
2,3,4.7,3.2,1.3,0.2,2
3,4,4.6,3.1,1.5,0.2,2
4,5,5.0,3.6,1.4,0.2,2


In [72]:
df['Species'] = pd.to_numeric(df.Species,errors='coerce')      
# Use this command to covert an Object to float.   
# Unrelated to the above conversion !!!!
# When coerce is set to True, it attempts to convert problematic data into a valid numeric type 
# (usually NaN for non-convertible data).

In [69]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species            int64
dtype: object

In [70]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,2
1,2,4.9,3.0,1.4,0.2,2
2,3,4.7,3.2,1.3,0.2,2
3,4,4.6,3.1,1.5,0.2,2
4,5,5.0,3.6,1.4,0.2,2
