#### Univariate Analysis
```plaintext
CCA also called list-wise deletion of cases, consists in discarding observation where values in any of the variables are missing. 

CCA litrally means analyzing only those observations for which there is information in all the variable in the dataset. 

In [102]:
import polars as pl
import plotly.express as px

In [103]:
pl.Config.set_tbl_rows(-1)
pl.Config.set_tbl_cols(None)

polars.config.Config

In [104]:
df = pl.read_csv('Data_set\\data_science_job.csv') 
df.head(2)

enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,training_hours,target
i64,str,f64,str,str,str,str,str,i64,str,str,f64,f64
8949,"""city_103""",0.92,"""Male""","""Has relevent experience""","""no_enrollment""","""Graduate""","""STEM""",20,,,36.0,1.0
29725,"""city_40""",0.776,"""Male""","""No relevent experience""","""no_enrollment""","""Graduate""","""STEM""",15,"""50-99""","""Pvt Ltd""",47.0,0.0


In [105]:
null_percent = df.select([
    (pl.col(c).is_null().sum()/df.height).alias(c) 
    for c in df.columns
])*100

# converting it to a vertical df
null_percent = null_percent.unpivot(
    index = [],                       # take all the columns and flatten it out
    variable_name='column',
    value_name = 'Percentage (%)'
)

null_percent

column,Percentage (%)
str,f64
"""enrollee_id""",0.0
"""city""",0.0
"""city_development_index""",2.500261
"""gender""",23.53064
"""relevent_experience""",0.0
"""enrolled_university""",2.014824
"""education_level""",2.401086
"""major_discipline""",14.683161
"""experience""",0.339284
"""company_size""",30.994885


```plaintext
CCA is usually applied on

1) When data is missing completely at random (MCAR)
2) column where the we have less than 5% data missing
```

| **Advantages**                                                                                         | **Disadvantages**                                                                                         |
|--------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|
| Easy to implement as no data manipulation is required                                                  | It can exclude a large fraction of the original dataset (if missing data is abundant).                    |
| Preserves variable distribution (if data is MCAR, distribution in reduced dataset matches original)    | Excluded observations could be informative for analysis (if data is not missing at random).              |


In [106]:
cols = [var for var in df.columns if df[var].is_null().mean() < 0.05 and df[var].is_null().mean() > 0]
cols

['city_development_index',
 'enrolled_university',
 'education_level',
 'experience',
 'training_hours']

In [107]:
df[cols].sample(5)

city_development_index,enrolled_university,education_level,experience,training_hours
f64,str,str,i64,f64
0.897,"""no_enrollment""","""Masters""",6,34.0
0.887,"""Part time course""","""High School""",3,12.0
0.698,"""no_enrollment""","""High School""",20,87.0
0.739,"""Full time course""","""High School""",3,65.0
,"""Full time course""","""High School""",9,79.0


In [108]:
# count of rows with at least 1 null null in them.
# pl.all() - selects all the columns currently in scope. 
# .is_null() - check for null in each column, resulting in df of boolean (True where null).
# pl.any_horizontal() - goes row wise across the columns.
null_rows_mask = df[cols].select(pl.any_horizontal(pl.all().is_null())).to_series()
null_row_count = df.filter(null_rows_mask).height

print("Count of rows with al least one null value in the row : ", null_row_count)    
print ('Data retained after dropping rows wil null value : ', 100 - ((null_row_count/len(df)*100)),'%')

Count of rows with al least one null value in the row :  1976
Data retained after dropping rows wil null value :  89.68577095730244 %


In [109]:
new_df = df.select(cols).drop_nulls()
df.shape, new_df.shape

((19158, 13), (17182, 5))

#### when we have applied CCA plot a histogram BEFORE and AFTER applying CCA. if the distributions are almost similar then we can proceed with CCA. 

```plaintext
plot the PDF for each feature if the distribution is almost the same. proceed with CCA.

Observation - 
we can conclude the data was missing completely at random.

In [110]:
fig = px.histogram(df, x='training_hours', nbins=50, color_discrete_sequence=['navy'], opacity=0.6,
                   labels = {'training_hours' : 'Training Hour'})

fig.add_histogram(x = new_df['training_hours'], nbinsx=50, name = "After CCA",
                  marker_color = 'red', opacity=0.4)

fig.update_layout(
    title = "Training Hours Distribution", 
    xaxis_title = "Training Hours",
    yaxis_title  = "Density", 
    barmode = 'overlay', 
    bargap = 0.01, 
    legend_title = "Data Type"
)

fig.show()

---

Now for CATEGORICAL column we need to make sure the cca maintained befor applying CCA should almost be the same after applying CCA.

In [111]:
total_df = df.height
total_new_df = new_df.height

df_counts = (
    df.group_by('enrolled_university')
    .agg([
    # pl.len() - in polars returns the number of rows in each group when inside a .group_by().agg() operation
        (pl.len()/total_df).alias('original')
    ])
)

new_df_count = (
    new_df.group_by('enrolled_university')
    .agg([
        (pl.len()/total_new_df).alias('cca')
        
    ])
)

# join the 2 tables 
result = (
    df_counts.join(new_df_count, on='enrolled_university', how = 'full')
        .sort('original', descending=True)
)

result = result.select(['enrolled_university', 'original', 'cca'])
result = result.filter(result['enrolled_university'].is_not_null())
result

enrolled_university,original,cca
str,f64,f64
"""no_enrollment""",0.721213,0.735188
"""Full time course""",0.196106,0.200733
"""Part time course""",0.062533,0.064079


The ratios are almost the same look into the other columns also. if they all match then we can proceed with CCA. 