# Exploratory Data Analysis

## Data Wrangling

-  Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze. Due to the rapid expansion of the amount of data and data sources available today, storing and organizing large quantities of data for analysis is becoming increasingly necessary

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('student.csv')
df

Unnamed: 0,age,height
0,20,160
1,21,161
2,22,162
3,23,163
4,24,164
5,25,165
6,26,166
7,27,167
8,28,168
9,29,169


### Info

- Provide number of rows & columns
- Provide idea about null value
- data type for each column
- memory consumption
- total entries number
- Provid the data frame structure 

In [3]:
df.info()  # Provide Basis information about a dataset(rows & colums), this the first of exploratory data anlysis

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     11 non-null     int64
 1   height  11 non-null     int64
dtypes: int64(2)
memory usage: 304.0 bytes


### Describe

- Provide Descriptive Statistics/ Insights

In [4]:
df.describe()    

Unnamed: 0,age,height
count,11.0,11.0
mean,25.0,165.0
std,3.316625,3.316625
min,20.0,160.0
25%,22.5,162.5
50%,25.0,165.0
75%,27.5,167.5
max,30.0,170.0


In [5]:
df.describe().T  # Transpose

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,11.0,25.0,3.316625,20.0,22.5,25.0,27.5,30.0
height,11.0,165.0,3.316625,160.0,162.5,165.0,167.5,170.0


- Q1 : 25%
- Q2 : 50%
- Q3 : 75%
- Standard Deviation : Same, because both are same data, Becasue the data created by aray range, Shape must be same for aray type

In [6]:
df.isnull() # Returned false, because no missing values is there

Unnamed: 0,age,height
0,False,False
1,False,False
2,False,False
3,False,False
4,False,False
5,False,False
6,False,False
7,False,False
8,False,False
9,False,False


In [7]:
df2 = pd.read_csv('product.csv')
df2

Unnamed: 0,Product,Sales
0,Cable,500.0
1,Lights,210.0
2,Fan,
3,Switch,255.0
4,CCT Breker,
5,Tape,350.0
6,Super,400.0
7,Meter,410.0


# Missing Values/ Null Value Treatment

- Handling missiong values is very important for ml
- Missing value may be in different form e.g N/A, ? NaN ect. all this form need to convert into "np.nan" for better treatment

In [8]:
df2.isnull()  # It will return True when find missing/ null value

Unnamed: 0,Product,Sales
0,False,False
1,False,False
2,False,True
3,False,False
4,False,True
5,False,False
6,False,False
7,False,False


In [9]:
df2.replace('None', np.nan) # DataFrame not updated without inplace true

Unnamed: 0,Product,Sales
0,Cable,500.0
1,Lights,210.0
2,Fan,
3,Switch,255.0
4,CCT Breker,
5,Tape,350.0
6,Super,400.0
7,Meter,410.0


In [10]:
df2.isnull() # Not updated verifed

Unnamed: 0,Product,Sales
0,False,False
1,False,False
2,False,True
3,False,False
4,False,True
5,False,False
6,False,False
7,False,False


In [11]:
df2.replace('None', np.nan, inplace=True)

In [12]:
df2.isnull()

Unnamed: 0,Product,Sales
0,False,False
1,False,False
2,False,True
3,False,False
4,False,True
5,False,False
6,False,False
7,False,False


In [13]:
df2.isnull().sum()   # Check/ Count of Missing Values, aggreration works on numpy very well

Product    0
Sales      2
dtype: int64

In [14]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Product  8 non-null      object 
 1   Sales    6 non-null      float64
dtypes: float64(1), object(1)
memory usage: 256.0+ bytes


## Data Type Conversion

- Required when numerical data type is shown as "Object" type
- Generally when missing values replace with np.nan then the data type convert numbert o text/ str that need to covert to numerial data again

In [15]:
df3 = pd.read_csv('student2.csv')
df3

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,
4,,164.0
5,25.0,165.0
6,26.0,
7,27.0,167.0
8,,168.0
9,29.0,169.0


In [16]:
df3.isnull()

Unnamed: 0,age,height
0,False,False
1,False,False
2,False,False
3,False,True
4,True,False
5,False,False
6,False,True
7,False,False
8,True,False
9,False,False


In [17]:
df3.replace('None', np.nan, inplace=True)

In [18]:
df3.isnull()

Unnamed: 0,age,height
0,False,False
1,False,False
2,False,False
3,False,True
4,True,False
5,False,False
6,False,True
7,False,False
8,True,False
9,False,False


In [19]:
df3.isnull().sum()

age       2
height    2
dtype: int64

In [20]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     9 non-null      float64
 1   height  9 non-null      float64
dtypes: float64(2)
memory usage: 304.0 bytes


In [22]:
df3.astype(np.float16)

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,
4,,164.0
5,25.0,165.0
6,26.0,
7,27.0,167.0
8,,168.0
9,29.0,169.0


In [23]:
new_df = df3.copy() # Deep copy

In [24]:
df3

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,
4,,164.0
5,25.0,165.0
6,26.0,
7,27.0,167.0
8,,168.0
9,29.0,169.0


## Missing Value treatment

In [29]:
df3.isnull()

Unnamed: 0,age,height
0,False,False
1,False,False
2,False,False
3,False,True
4,True,False
5,False,False
6,False,True
7,False,False
8,True,False
9,False,False


In [30]:
df3.isnull().sum()

age       2
height    2
dtype: int64

- .dropna()
- .fillna()

- NaN never never be directly conveted into int untill or unless being treatment / drop

In [25]:
df3.dropna()    # 0: delete as per row  ; 1: delete as per column, bydefualt is 0

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
5,25.0,165.0
7,27.0,167.0
9,29.0,169.0
10,30.0,170.0


In [26]:
df3

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,
4,,164.0
5,25.0,165.0
6,26.0,
7,27.0,167.0
8,,168.0
9,29.0,169.0


In [32]:
# Filling the missing values with scalar e.g 50
df3.fillna(50)  

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,50.0
4,50.0,164.0
5,25.0,165.0
6,26.0,50.0
7,27.0,167.0
8,50.0,168.0
9,29.0,169.0


In [34]:
# Filling missing values with dict, or filling explictely
df3.fillna({'age':99, 'height':69})

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,69.0
4,99.0,164.0
5,25.0,165.0
6,26.0,69.0
7,27.0,167.0
8,99.0,168.0
9,29.0,169.0


In [37]:
df3

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,
4,,164.0
5,25.0,165.0
6,26.0,
7,27.0,167.0
8,,168.0
9,29.0,169.0


In [38]:
df3.fillna(method='ffill')  # 'ffill' / means forward fill, e.g will be filled by earlier/ previous value

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,162.0
4,23.0,164.0
5,25.0,165.0
6,26.0,165.0
7,27.0,167.0
8,27.0,168.0
9,29.0,169.0


In [40]:
df3.fillna(method='bfill') # 'bfill' means back fill, e.g will be filled by later/ next value

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,164.0
4,25.0,164.0
5,25.0,165.0
6,26.0,167.0
7,27.0,167.0
8,29.0,168.0
9,29.0,169.0


In [42]:
df3.fillna(method='bfill', inplace=True)
df3

Unnamed: 0,age,height
0,20.0,160.0
1,21.0,161.0
2,22.0,162.0
3,23.0,164.0
4,25.0,164.0
5,25.0,165.0
6,26.0,167.0
7,27.0,167.0
8,29.0,168.0
9,29.0,169.0


In [43]:
df3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     11 non-null     float64
 1   height  11 non-null     float64
dtypes: float64(2)
memory usage: 304.0 bytes


## DataFrame Type Conversion

In [44]:
df3.astype(int)

Unnamed: 0,age,height
0,20,160
1,21,161
2,22,162
3,23,164
4,25,164
5,25,165
6,26,167
7,27,167
8,29,168
9,29,169


In [45]:
df3.astype(int).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     11 non-null     int32
 1   height  11 non-null     int32
dtypes: int32(2)
memory usage: 216.0 bytes


In [47]:
df4 = df3.astype(np.int64)
df4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11 entries, 0 to 10
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   age     11 non-null     int64
 1   height  11 non-null     int64
dtypes: int64(2)
memory usage: 304.0 bytes


In [51]:
# pd.to_numeric(df2)  # This can be applied only on column wise, not row & columns together

In [52]:
df2

Unnamed: 0,Product,Sales
0,Cable,500.0
1,Lights,210.0
2,Fan,
3,Switch,255.0
4,CCT Breker,
5,Tape,350.0
6,Super,400.0
7,Meter,410.0


In [59]:
pd.to_numeric(df2['Sales'])  # acess
pd.to_numeric(df2.Sales)  # or Access

0    500.0
1    210.0
2      NaN
3    255.0
4      NaN
5    350.0
6    400.0
7    410.0
Name: Sales, dtype: float64