# Motivation

In this assignment, we'll be performing basic EDA on three different datasets to gain a wider perspective. We'll be building up on this work by later doing feature engineering and feature selection on these datasets, and hopefully your experience in EDA will give you ideas about how to do it well.

Here are the three datasets and the basic facts about them:

* Predicting Conversion (Conversion Rate)
    - Binary Classification
    - 5 features (2 categorical)
* Employee Retention
    - Regression
    - 6 features (2 categorical)
    - two timestamps
* Identifying Fraudulent Activities
    - Binary Classification
    - 10 features (8 categorical)
    

# Task 1: Exploratory Data Analysis for Conversion Rate Dataset

## a. Function that returns a list of the names of categorical variables

### Input:

Type: dataframe

### Output:

Type: list

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: ["sex", "Pclass", ...]

In [1]:
filepath = "conversion_data.csv"

In [2]:
import pandas as pd

data = pd.read_csv(filepath)
df = pd.DataFrame(data)
df.head()

Unnamed: 0,country,age,new_user,source,total_pages_visited,converted
0,UK,25,1,Ads,1,0
1,US,23,1,Seo,5,0
2,US,28,1,Seo,4,0
3,China,39,1,Seo,5,0
4,US,30,1,Seo,6,0


In [3]:
df._get_numeric_data().columns

Index([u'age', u'new_user', u'total_pages_visited', u'converted'], dtype='object')

In [None]:
# check for variables having non_numeric entries or numeric entries limited to a few values(preferably less than 4)
def categorical_variable(dataframe):
    list1 = []
    for i in dataframe:
        if len(set(dataframe[i].values)) <= 4 or i not in dataframe._get_numeric_data().columns:
            list1.append(i)
    return list1

In [None]:
categorical_variable(df)

## b. Function that returns the list of the names of numeric variables

### Input:

Type: dataframe

### Output:

Type: list

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: ["fare", "age", ...]

In [None]:
def numeric_variable(dataframe):
    list2 = [i for i in dataframe if i not in categorical_variable(dataframe)]
    return list2

In [None]:
numeric_variable(df)

## c. Function that reutrns, for numeric variables, mean, median, 25, 50, 75th percentile 

### Input:

Type: dataframe

### Output:

Type: dataframe with 

columns:

* variable name
* mean
* median
* 25th percentile
* 50th percentile
* 75th percentile

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: dataframe

In [None]:
def stats(dataframe):
    dataframe = dataframe[numeric_variable(dataframe)].copy()
    stats_df = dataframe.quantile([0.25, 0.5, 0.75])
    stats_df = stats_df.T
    stats_df["mean"] = dataframe.mean()
    stats_df["median"] = stats_df[0.5]
    stats_df = stats_df.reset_index()
    stats_df = stats_df.rename(columns={"index": "variable name", 0.25: "25th percentile", 0.5: "50th percentile", 0.75: "75th percentile"})
    return stats_df

In [None]:
stats(df)

## d. For categorical variables, get modes

### Input:

Type: dataframe

### Output:

Type: dict

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: {"Pclass": 28, "Sex": 1, ...}

In [None]:
def mode(dataframe):   
    dict1 = {}
    for i in dataframe[categorical_variable(dataframe)]:
         frequency = dataframe[i].value_counts()
         mode = frequency.idxmax()
         dict1[i] = mode
    return dict1

In [None]:
mode(df)

## e. For each column, list the count of missing values

### Input:

Type: dataframe

### Output:

Type: dataframe with 

columns

* var_name
* missing_value_count

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: dataframe

In [None]:
def missing_value_count(dataframe):
    counts_df = dataframe.isnull().sum()
    counts_df = counts_df.reset_index()
    counts_df = counts_df.rename(columns={"index": "var_name", 0: "missing_value_count"})
    return counts_df

In [None]:
missing_value_count(df)

## f. Plot histograms using different subplots of all the numerical values in a single plot

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns


df[numeric_variable(df)].plot(subplots=True,kind = 'hist')
plt.show()

### f.a. Add column names as plot names 

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df[numeric_variable(df)].plot(subplots=True,kind = 'hist',title = numeric_variable(df) )
plt.show()

### f.b. Change the histogram colour to yellow

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df[numeric_variable(df)].plot(subplots=True,kind = 'hist', title = numeric_variable(df), color = "y")
plt.show()

### f.c. Fit a normal curve on those histograms

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

## g. Plot facet box plots to check out the distribution according to the target variable

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["sex", "Pclass", ...]
* Expected output: matplotlib plot

In [None]:
df[numeric_variable(df)].plot(subplots=True,kind = 'box')
plt.show()

# Task 2: EDA for Employee Retention Dataset


## a. Function that returns a list of the names of categorical variables

### Input:

Type: dataframe

### Output:

Type: list

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: ["sex", "Pclass", ...]

In [None]:
filepath_1 = "/home/satyabrat/Downloads/employee_retention_data.csv"

In [None]:
data1 = pd.read_csv(filepath_1)
df1 = pd.DataFrame(data1)
df1.head()

In [None]:
categorical_variable(df1)

## b. Function that returns the list of the names of numeric variables

### Input:

Type: dataframe

### Output:

Type: list

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: ["fare", "age", ...]

In [None]:
numeric_variable(df1)

## c. Function that reutrns, for numeric variables, mean, median, 25, 50, 75th percentile 

### Input:

Type: dataframe

### Output:

Type: dataframe with 

columns:

* variable name
* mean
* median
* 25th percentile
* 50th percentile
* 75th percentile

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: dataframe

In [None]:
stats(df1)

## d. For categorical variables, get modes

### Input:

Type: dataframe

### Output:

Type: dict

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: {"Pclass": 28, "Sex": 1, ...}

In [None]:
mode(df1)

## e. For each column, list the count of missing values

### Input:

Type: dataframe

### Output:

Type: dataframe with 

columns

* var_name
* missing_value_count

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: dataframe

In [None]:
missing_value_count(df1)

## f. Plot histograms using different subplots of all the numerical values in a single plot

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df1[numeric_variable(df1)].plot(subplots=True,kind = 'hist')
plt.show()

### f.a. Add column names as plot names 

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df1[numeric_variable(df1)].plot(subplots=True,kind = 'hist', title = numeric_variable(df1))
plt.show()

### f.b. Change the histogram colour to yellow

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df1[numeric_variable(df1)].plot(subplots=True,kind = 'hist', title = numeric_variable(df1), color = "y" )
plt.show()

### f.c. Fit a normal curve on those histograms

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

# Task 3: EDA for Fradulent Activities Dataset

In [None]:
filepath1 = "/home/satyabrat/Downloads/Translation_Test/test_table.csv"
filepath2 = "/home/satyabrat/Downloads/Translation_Test/user_table.csv"

In [None]:
import pandas as pd

data_test = pd.read_csv(filepath1)
df_test = pd.DataFrame(data_test)
df_test.head()

In [None]:
data_user = pd.read_csv(filepath2)
df_user = pd.DataFrame(data_user)
df_user.head()

## a. Map each user to his country based on his IP address

### Input:

Type: dataframe, dataframe

### Output:

Type: dataframe

### Contrains:

### Test cases:

* Input value: dataframe1, dataframe2
* Expected output: dataframe

In [None]:
df_map = df_user.merge(df_test, on='user_id')
df_map

## a. Function that returns a list of the names of categorical variables

### Input:

Type: dataframe

### Output:

Type: list

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: ["sex", "Pclass", ...]

In [None]:
categorical_variable(df_map)

## b. Function that returns the list of the names of numeric variables

### Input:

Type: dataframe

### Output:

Type: list

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: ["fare", "age", ...]

In [None]:
numeric_variable(df_map)

## c. Function that reutrns, for numeric variables, mean, median, 25, 50, 75th percentile 

### Input:

Type: dataframe

### Output:

Type: dataframe with 

columns:

* variable name
* mean
* median
* 25th percentile
* 50th percentile
* 75th percentile

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: dataframe

In [None]:
stats(df_map)

## d. For categorical variables, get modes

### Input:

Type: dataframe

### Output:

Type: dict

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: {"Pclass": 28, "Sex": 1, ...}

In [None]:
mode(df_map)

## e. For each column, list the count of missing values

### Input:

Type: dataframe

### Output:

Type: dataframe with 

columns

* var_name
* missing_value_count

### Contrains:

### Test cases:

* Input value: Titanic dataframe
* Expected output: dataframe

In [None]:
missing_value_count(df_map)

## f. Plot histograms using different subplots of all the numerical values in a single plot

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df_map[numeric_variable(df_map)].plot(subplots=True,kind = 'hist')
plt.show()

### f.a. Add column names as plot names 

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df_map[numeric_variable(df_map)].plot(subplots=True,kind = 'hist',title = numeric_variable(df) )
plt.show()

### f.b. Change the histogram colour to yellow

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

In [None]:
df_map[numeric_variable(df_map)].plot(subplots=True,kind = 'hist',title = numeric_variable(df), color = 'y')
plt.show()

### f.c. Fit a normal curve on those histograms

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["age", "Fare", ...]
* Expected output: matplotlib plot

## g. Plot facet box plots to check out the distribution according to the target variable

### Input:

Type: dataframe, list_of_columns

### Output:

Type: matplotlib plot

### Contrains:

### Test cases:

* Input value: Titanic dataframe, ["sex", "Pclass", ...]
* Expected output: matplotlib plot

In [None]:
df_map[numeric_variable(df_map)].plot(subplots=True,kind = 'box')
plt.show()