# COVID-19 Data Exploration

This notebook contains the preprocessing part of the assignment, which includes data visualization, assessment for feature engineering (transformations, dimension reduction, interaction between features) and finally feature selection.

In [1]:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as f
from pyspark.sql.functions import countDistinct, approxCountDistinct

import pandas as pd
from utils.visualization import cat_plot, plot_counts, plot_hist
from utils.functions import *

import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure

# sns.set_style("darkgrid")
%matplotlib inline

ModuleNotFoundError: No module named 'pyspark'

In [None]:
spark = SparkSession.builder \
    .master("local") \
    .appName("COVID-19") \
    .getOrCreate()

In [None]:
df = data_retrieval(spark).sample(False, 0.05)

In [None]:
n = df.count()
m = len(df.columns)
(n , m)

Checking for missing values:

In [None]:
df.agg(*[(f.count(f.when(f.isnull(c) | f.isnan(c) | (f.col(c) == -1), c))/f.count(c)).alias(c) for c in df.columns]).toPandas().T

Feature_12 has almost 50% missing values which represent with -1, we'll omit this feature for now 

Seperation for numercial and categorical features

In [None]:
numerical_cols = ['feature_time', 'feature_2', 'feature_3', 'feature_4',  'feature_15', 'label']
categorical_cols = ['feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'feature_10', 'feature_11', 'feature_13', 'feature_14', 'feature_16', 'feature_17', 'feature_18', 'feature_19', 'feature_20', 'feature_21'] 

In [None]:
from pyspark.sql.window import Window
df.groupby('label').count().withColumn('prc', f.round(f.col('count')/f.sum('count').over(Window.partitionBy()), 4)).show()

In [None]:
df = df.select(*(f.col(c).cast("int").alias(c) for c in numerical_cols), *categorical_cols).toPandas()
numerical_cols.remove('label')

Imbalanced data with only 17% of positive labels, we can balance the data by over/under sampling, change the loss function (tune ) or keep as is but look at recall, precision and auc to make sure that the model's predictions are sufficient.

### Categorical features:

Categorical features - There aren't many features in the data set but there are some features with many categories, we can remove these features or merge rare categories, I set the threshold for 10 after visualization, I saw many observations with very small values.
One of the methods is to merge categories with similar label distribution.

Removed categorical features based on visualization distributions:
Â 
- feature 7 uniform distribution of positive label
- feature 8 has a very skewed distribution with - we could transform the feature to include only a single category with a high positive rate
- feature 12 had a lot of nulls that should be transformed into "other" category
- feature 14 uniform distribution of positive label

In [None]:
for col in categorical_cols:
    print(col)
    df['col_merge'] = merge_categories(df[col], 50, encode=False)
    d = {'value_counts': df['col_merge'].value_counts(), 
         'prc': df['col_merge'].value_counts(normalize=True),
         'labeled': df.groupby('col_merge')['label'].sum(),
         'prc_labeled_out': df.groupby('col_merge')['label'].sum()/df['label'].sum(),
         'prc_labeled_in': df.groupby('col_merge')['label'].mean()
        }
    
    try:
        col_summary = pd.DataFrame(data=d).reset_index().rename(columns={'index':col}).sort_values('prc_labeled_in')
        col_summary = pd.melt(col_summary, id_vars=[col, 'value_counts', 'prc', 'labeled'], value_vars=['prc_labeled_out', 'prc_labeled_in']) 
    except:
        col_summary = pd.DataFrame(data=d).reset_index().rename(columns={'col_merge':col}).sort_values('prc_labeled_in')
        col_summary = pd.melt(col_summary, id_vars=[col, 'value_counts', 'prc', 'labeled'], value_vars=['prc_labeled_out', 'prc_labeled_in']) 
    cat_plot(col_summary, col)
#     cat_plot(col_summary, col, ('prc_labeled_out', 'prc labeled out-sample'))

### Numerical features:

In [None]:
df[numerical_cols].describe()

In [None]:
for col in numerical_cols:
    print(col)
    plot_hist(df[df['label']==0][col], df[df['label']==1][col], col)
#     plot_hist(np.log(df[df['label']==0][col]), np.log(df[df['label']==1][col]), col)
#     plot_hist(np.log1p(df[df['label']==0][col].pct_change()), np.log1p(df[df['label']==1][col].pct_change()), col)
 

I checked for correlation between the numerical features and found that feature 2 and 3 have a very high correlation (0.97) with each other, I will remove one of them

In [None]:
display(df.corr())
plt.matshow(df.corr())
