# EDA phase

## CSV file structure

The CSV files downloaded from the IMSS website follow the same structre for all months of each year. Therefore, for this exploration phase, a single file will be used to get a better understanding of how the data is stored in these.

In [1]:
import pandas as pd
import os
import json

In [79]:
# Get the IMSS files dtypes json and import it as a dictionary
with open(os.path.join(os.getcwd(),'IMSS_files_dtypes.json')) as dtypes_json:
    IMSS_files_dtypes = json.load(dtypes_json)

# Use getcwd to get the current folder location, then change the final folder in path to access the Scraping folder
IMSS_files_location = os.path.join(os.getcwd().replace('EDA','Scraping'),'IMSS_Files')

# Get list of files in target location
IMSS_files = os.listdir(IMSS_files_location)

# Get a file from the list
IMSS_first_file = os.path.join(IMSS_files_location, IMSS_files[0])
print(IMSS_first_file)

# Create dataframe from the first IMSS file
IMSS_df = pd.read_csv(IMSS_first_file, sep='|', encoding='latin-1',dtype=IMSS_files_dtypes)

C:\Users\J-D-S\Documents\Projects\IMSS-Salary-Analysis\Scraping\IMSS_Files\asg-2021-01-31.csv


# NaN value exploration
Below, the dataframe is printed to see its total entries and features. Some of these contain NaN values, which must be explored to determine the effect these have on the dataframe

In [80]:
# Print the dataframe
IMSS_df

Unnamed: 0,cve_delegacion,cve_subdelegacion,cve_entidad,cve_municipio,sector_economico_1,sector_economico_2,sector_economico_4,tamaño_patron,sexo,rango_edad,...,ta_sal,teu_sal,tec_sal,tpu_sal,tpc_sal,masa_sal_ta,masa_sal_teu,masa_sal_tec,masa_sal_tpu,masa_sal_tpc
0,1,1,1,A01,,,,,1,E1,...,0,0,0,0,0,0.00,0.00,0.0,0.00,0.0
1,1,1,1,A01,,,,,1,E10,...,0,0,0,0,0,0.00,0.00,0.0,0.00,0.0
2,1,1,1,A01,,,,,1,E11,...,0,0,0,0,0,0.00,0.00,0.0,0.00,0.0
3,1,1,1,A01,,,,,1,E12,...,0,0,0,0,0,0.00,0.00,0.0,0.00,0.0
4,1,1,1,A01,,,,,1,E13,...,0,0,0,0,0,0.00,0.00,0.0,0.00,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4557040,40,58,9,,9,99,9900,S3,2,E6,...,1,1,0,0,0,679.22,679.22,0.0,0.00,0.0
4557041,40,58,9,,9,99,9900,S3,2,E7,...,1,0,0,1,0,1416.55,0.00,0.0,1416.55,0.0
4557042,40,58,9,,9,99,9900,S3,2,E8,...,1,0,0,1,0,2172.00,0.00,0.0,2172.00,0.0
4557043,40,58,9,,9,99,9900,S3,2,E9,...,1,0,0,1,0,2172.00,0.00,0.0,2172.00,0.0


In [81]:
# Get the number of empty entries for each of the dataframe's columns
for column in IMSS_df.columns:
    empty_entries_in_column = IMSS_df[column].isna().sum()
    if empty_entries_in_column > 0:
        print(f'Number of empty rows in the {column} colum: {empty_entries_in_column}')

Number of empty rows in the cve_municipio colum: 508966
Number of empty rows in the sector_economico_1 colum: 16477
Number of empty rows in the sector_economico_2 colum: 16477
Number of empty rows in the sector_economico_4 colum: 16477
Number of empty rows in the tamaño_patron colum: 26663
Number of empty rows in the rango_salarial colum: 26614
Number of empty rows in the rango_uma colum: 26614


## Empty entries
The previous for loop serves to count how many of the features in the dataset have empty values. These features are:
* cve_municipio
* sector_economico_1
* sector_economico_2
* sector_economico_4
* tamaño_patron
* rango_salarial
* rango_uma

cve_municipio is the code that's assigned for each city of its respective state. Therefore a separate dataframe will be created to analyze these missing values.

sector_economico_1,sector_economico_2 and sector_economico_4 have the same amount of missing values, therefore its likely that these missing values are all found in the same rows. These 3 features will be grouped in a different dataframe to find where these missing values are from.

tamaño_patron will be analyzed in its own dataframe since its missing value count is not related to the other features.

rango_salarial and rango_uma are different calculations for the salary. This is because the minimum wage, and UMA have a different value. These 2 features will be analyzed in a different dataframe

In [157]:
## Create function that returns the relevant analysis for the empty fields
def nan_value_analysis(source_df, feature):    
    # Create a subset of the main dataframe where the target feature is NaN
    nan_feature_df = source_df.loc[source_df[feature].isna()]
    
    # Get the unique cve_delegacion values for the NaN subset
    unique_cve_delegacion_values = nan_cve_municipio_df.cve_delegacion.unique()
    
    # Get subset dataframe description for numeric features
    numeric_feature_description = nan_sector_economico_df.describe(exclude='object').apply(lambda col: col.map('{:.1f}'.format))
    
    # Store results in dict
    nan_value_results_dict = {
        'unique_cve_delegacion_values' : unique_cve_delegacion_values,
        f'{feature}_subset_dataframe_description' : numeric_feature_description
    }
    
    return nan_value_results_dict

In [158]:
test = nan_value_analysis(IMSS_df,'cve_municipio')
test.keys()

dict_keys(['unique_cve_delegacion_values', 'cve_municipio_subset_dataframe_description'])