# Data preparation

Below are the steps involved to understand, clean and prepare your data for building your predictive model:

   1.  Variable Identification
   2.  Univariate Analysis
   3.  Bi-variate Analysis
   4.  Missing values treatment
   5.  Outlier treatment
   6.  Variable transformation
   7.  Variable creation
   
Finally, we will need to iterate over steps 4 – 7 multiple times before we come up with our refined model.

In [None]:
import numpy as np
import pandas as pd

#Plotly Libraris
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
from plotly.colors import n_colors
from plotly.subplots import make_subplots
# Minmax scaler
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [None]:
# covid=pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')

covid_line=pd.read_csv('../input/novel-corona-virus-2019-dataset/COVID19_line_list_data.csv')
# titanic=pd.read_csv('../input/titanic/train.csv')
# us_counties=pd.read_csv('../input/us-counties-covid-19-dataset/us-counties.csv')

house=pd.read_csv("../input/house-prices-advanced-regression-techniques/train.csv")
# netflix=pd.read_csv("../input/netflix-shows/netflix_titles.csv")

world=pd.read_csv('../input/world-university-rankings/cwurData.csv')
# google=pd.read_csv("../input/google-play-store-apps/googleplaystore.csv")
# campus=pd.read_csv('../input/factors-affecting-campus-placement/Placement_Data_Full_Class.csv')

# Variable Identification 

Identify Predictor (Input) and Target (output) variables 

For example for the given dataset, problem statements could be, 

1. Given the information of the patient what is 

In [None]:
covid_line.info()

In [None]:
world.info()

 Seperate them into Datatypes of catogarical and Continuouse ones 

In [None]:
a = world.dtypes[world.dtypes == 'object'].index
world[a].head()

# Univariate analysis 

 For continuouse variables understand the data central tendency and its spread by using 
    
    1. Mean mode median
    2. IQR
    3. MIN
    4. MAX
    5. Varience
    6. Box plot 
    7. Histograms
    

In [None]:
#plot the data values 

world.describe()



Purpose : Display distribution of a continous variable.

Question : How are the score spread for different universities in Germany?


In [None]:
germany_score=world[world['country']=="Germany"]['score']
fig = go.Figure(go.Box(y=germany_score,name="Germany Score")) # to get Horizonal plot change axis :  x=germany_score
fig.update_layout(title="Distribution of Germany University Scores")
fig.show()



Purpose : Display distribution of a continous variable.

Question : What is the salary distribution of Science & Technology graduates in a normalized manner?


In [None]:
campus_science=campus[campus['degree_t']=='Sci&Tech']['salary']
fig = go.Figure(data=[go.Histogram(x=germany_score,histnorm='probability',
                                  marker_color="orange")])

# To get Horizontal plot ,change axis - y=campus_computer
fig.update_layout(title="Distribution of Germany University Scores",xaxis_title="Score",yaxis_title="Counts")
fig.show()

For catogorical values, 

    1. Bar chart
    2. Bubble plot


Bar chart:

Purpose : Displays quantitative representation of a variable.

Question : How many universities in each countries have good score? (Filtered for universities with score greater than 60)


In [None]:
top_countries=world[world['score']>60]['country'].value_counts().reset_index().rename(columns={'index':'country','country':'count'})

fig = go.Figure(go.Bar(
    x=top_countries['country'],y=top_countries['count'],
))
fig.update_layout(title_text='Top Countries with number of Universities score greater than 60',xaxis_title="Country",yaxis_title="Number of Universities")
fig.show()

Bubble Plot:

Purpose : Displays quantitative representation highlighting the most occured category with the size of bubble.

Question : How many universities in each countries have good score? (Filtered for universities with score greater than 60)

In [None]:
fig = go.Figure(data=[go.Scatter(
    x=top_countries['country'],y=top_countries['count'],
    mode='markers',
    marker=dict(
        color = top_countries['count']*2,
        showscale = True,
        size=top_countries['count']*2))])

fig.update_layout(title_text='Top Countries with number of Universities score greater than 60',xaxis_title="Country",yaxis_title="Number of Universities")
fig.show()

# Bivariate analysis

Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. 

between continuous and continuous variables, 

    1. Scatter plot

for catogarical variable, 
    
    1. Z-test and student t-test : This test is also used for the hypothesis testing

Scatter plot:

Relationship between nemerical values

In [None]:
fig = px.scatter(world, x='score', y='national_rank')
fig.update_layout(title='Score vs Rank',xaxis_title="Score",yaxis_title="Rank")
fig.show()

# Missing value treatment 

Bi-variate Analysis finds out the relationship between two variables. Here, we look for association and disassociation between variables at a pre-defined significance level. How to treat missing values, 

    1. Deletion
    2. Mean/Mode/Median imputation
    3. Prediction Model: In this we train a model that is let to overfit to predict the missing values, We will see and example a little later

Deletion:

    Delete a column: Thumb rule is if the one of the features is empty more then 50%  and there is no observeable correlation between that feature and the target. Then drop that column. 
    
    Delete a row: If there are garbage values in a particular column them we drop that row. 
        

In [None]:
ls = covid_line.isnull().sum().to_frame(name = "Null_count").reset_index()
fig = go.Figure(go.Bar(
    x=ls['index'],y=ls['Null_count'],
))
fig.update_layout(title_text='Null counts by bar graph',xaxis_title="Columns",yaxis_title="Number of Null")
fig.show()

Dropping column wise as well as row wise

In [None]:
covid_line.drop(["Unnamed: 21" ,"Unnamed: 26", "Unnamed: 22" , "Unnamed: 23" ] , axis = 1 , inplace = True)

There is one reporting date row which is having missing value 
and summary is having 5 values


In [None]:
null_in_summary = covid_line[covid_line["summary"].isnull()].index.to_list()
null_in_reporting = covid_line[covid_line["reporting date"].isnull()].index.to_list()

In [None]:
covid_line.drop(null_in_summary, axis = 0, inplace = True )
covid_line.drop(null_in_reporting , axis = 0 , inplace = True)

# Imputation  

In [None]:
# mean mode  when to mean and mode fffill bfill
ls

# Outlier detection and removal 

Detection is preformed by Box plot, violin plot, scatter plot or simply looking into IQR range. 

How to remove them? 

1. Deletion 
2. Binning values
3. Impution

Deletion  

In [None]:
#drop when and why 

Imputation

In [None]:
# gender and age front fill and back fill 

# Variable Tranformation

When should we use Variable Transformation?

Below are the situations where variable transformation is a requisite:

    When we want to change the scale of a variable or standardize the values of a variable for better 
    understanding. While this transformation is a must 
    if you have data in different scales, this transformation 
    does not change the shape of the variable distribution

    When we can transform complex non-linear relationships into 
    linear relationships. Existence of a linear relationship between 
    variables is easier to comprehend compared to a non-linear or curved 
    relation. Transformation helps us to convert a non-linear relation into 
    linear relation. Scatter plot can be used to find the relationship between 
    two continuous variables. These transformations also improve the prediction. 
    Log transformation is one of the commonly used transformation technique used 
    in these situations.


Variable transform methods
    
    1. Logarithm
    2. Square/Cube root

# Variable creation 

Need? 
