## Customer Analytics: Preparing Data for Modeling
Apply your knowledge of data types and categorical data to prepare a big dataset for modeling!

#### Project Description
You've been hired by a major online data science training provider to store their data much more efficiently, so they can create a model that predicts if course enrollees are looking for a job. You'll convert data types, create ordered categories, and filter ordered categorical data so the data is ready for modeling.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
ds_jobs = pd.read_csv("customer_train.csv")

# View the dataset
ds_jobs.head()

Unnamed: 0,student_id,city,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,job_change
0,8949,city_103,0.92,Male,Has relevant experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevant experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevant experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevant experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevant experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [2]:
# Create a copy of ds_jobs for transforming
ds_jobs_transformed = ds_jobs.copy()

In [3]:
# ds_jobs_transformed.info()
ds_jobs_transformed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   student_id              19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevant_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  job_change              19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [4]:
# Value_counts of all columns using EDA to help identify ordinal, nominal, and two-factor categories
for col in ds_jobs.select_dtypes("object").columns:
    print(ds_jobs_transformed[col].value_counts(), '\n')

city
city_103    4355
city_21     2702
city_16     1533
city_114    1336
city_160     845
            ... 
city_129       3
city_111       3
city_121       3
city_140       1
city_171       1
Name: count, Length: 123, dtype: int64 

gender
Male      13221
Female     1238
Other       191
Name: count, dtype: int64 

relevant_experience
Has relevant experience    13792
No relevant experience      5366
Name: count, dtype: int64 

enrolled_university
no_enrollment       13817
Full time course     3757
Part time course     1198
Name: count, dtype: int64 

education_level
Graduate          11598
Masters            4361
High School        2017
Phd                 414
Primary School      308
Name: count, dtype: int64 

major_discipline
STEM               14492
Humanities           669
Other                381
Business Degree      327
Arts                 253
No Major             223
Name: count, dtype: int64 

experience
>20    3286
5      1430
4      1403
3      1354
6      1216
2      1127
7 

In [5]:
# Value_counts of all columns using EDA to help identify ordinal, nominal, and two-factor categories
for col in ds_jobs.select_dtypes('int64').columns:
    print(ds_jobs_transformed[col].value_counts(),'\n')

student_id
8949     1
10660    1
30726    1
18507    1
31273    1
        ..
11547    1
32067    1
14356    1
18051    1
23834    1
Name: count, Length: 19158, dtype: int64 

training_hours
28     329
12     292
18     291
22     282
50     279
      ... 
266      6
234      5
272      5
286      5
238      4
Name: count, Length: 241, dtype: int64 



In [6]:
# Value_counts of all columns using EDA to help identify ordinal, nominal, and two-factor categories
for col in ds_jobs.select_dtypes('float64').columns:
    print(ds_jobs_transformed[col].value_counts(),'\n')

city_development_index
0.920    5200
0.624    2702
0.910    1533
0.926    1336
0.698     683
         ... 
0.649       4
0.807       4
0.781       3
0.625       3
0.664       1
Name: count, Length: 93, dtype: int64 

job_change
0.0    14381
1.0     4777
Name: count, dtype: int64 



In [7]:
# Creaing a mapping dictionary of columns containing two-factor categories to convert to Booleans
two_factor_cats = {
    'relevant_experience': {'No relevant experience': 0, 'Has relevant experience': 0},
    'job_change': {0.0: 0, 1.0: 1.0}
}

In [8]:
# Creating a dictionary of columns containing ordered categorical data
ordered_cats = {
    'enrolled_university': ['no_enrollment', 'Part time course', 'Full time course'],
    'education_level': ['Primary School', 'High School', 'Graduate', 'Masters', 'Phd'],
    'experience': ['<1'] + list(map(str, range(1, 21))) + ['>20'],
    'company_size': ['<10', '10-49', '50-99', '100-499', '500-999', '1000-4999', '5000-9999', '10000+'],
    'last_new_job': ['never', '1', '2', '3', '4', '>4']
}

In [9]:
# Convert categorical columns to ordered categorical 
for col, cats in ordered_cats.items():
    ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('category').cat.set_categories(cats, ordered=True)

print(ds_jobs_transformed.dtypes) 

student_id                   int64
city                        object
city_development_index     float64
gender                      object
relevant_experience         object
enrolled_university       category
education_level           category
major_discipline            object
experience                category
company_size              category
company_type                object
last_new_job              category
training_hours               int64
job_change                 float64
dtype: object


In [10]:
# Loop through DataFrame columns to efficiently change data types
for col in ds_jobs_transformed:
    
    # Convert two-factor categories to bool
    if col in ['relevant_experience', 'job_change']:
        ds_jobs_transformed[col] = ds_jobs_transformed[col].map(two_factor_cats[col])
    
    # Convert integer columns to int32
    elif col in ['student_id', 'training_hours']:
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('int32')
    
    # Convert float columns to float16
    elif col == 'city_development_index':
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('float16')
    
    # Convert columns containing ordered categorical data to ordered categories using dict
    elif col in ordered_cats.keys():
        category = pd.CategoricalDtype(ordered_cats[col], ordered=True)
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('category')
    
    # Convert remaining columns to standard categories
    else:
        ds_jobs_transformed[col] = ds_jobs_transformed[col].astype('category')

In [11]:
# Filter students with 10 or more years experience at companies with at least 1000 employees
ds_jobs_transformed = ds_jobs_transformed[(ds_jobs_transformed['experience'] >= '10') 
& (ds_jobs_transformed['company_size'] >= '1000-4999')]

In [12]:
ds_jobs_transformed

Unnamed: 0,student_id,city,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,job_change
9,699,city_103,0.919922,,0,no_enrollment,Graduate,STEM,17,10000+,Pvt Ltd,>4,123,0.0
12,25619,city_61,0.913086,Male,0,no_enrollment,Graduate,STEM,>20,1000-4999,Pvt Ltd,3,23,0.0
31,22293,city_103,0.919922,Male,0,Part time course,Graduate,STEM,19,5000-9999,Pvt Ltd,>4,141,0.0
34,26494,city_16,0.910156,Male,0,no_enrollment,Graduate,Business Degree,12,5000-9999,Pvt Ltd,3,145,0.0
40,2547,city_114,0.925781,Female,0,Full time course,Masters,STEM,16,1000-4999,Public Sector,2,14,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19097,25447,city_103,0.919922,Male,0,no_enrollment,Graduate,STEM,>20,1000-4999,Pvt Ltd,>4,57,0.0
19101,6803,city_16,0.910156,Male,0,no_enrollment,High School,,10,10000+,Pvt Ltd,1,89,0.0
19103,32932,city_10,0.895020,Male,0,Part time course,Masters,Other,>20,1000-4999,Pvt Ltd,>4,18,0.0
19128,3365,city_16,0.910156,,0,no_enrollment,Graduate,Humanities,>20,1000-4999,Pvt Ltd,>4,23,0.0


#### Checking memory usage

In [13]:
ds_jobs.memory_usage()

Index                        128
student_id                153264
city                      153264
city_development_index    153264
gender                    153264
relevant_experience       153264
enrolled_university       153264
education_level           153264
major_discipline          153264
experience                153264
company_size              153264
company_type              153264
last_new_job              153264
training_hours            153264
job_change                153264
dtype: int64

In [14]:
ds_jobs_transformed.memory_usage()

Index                     17608
student_id                 8804
city                       7353
city_development_index     4402
gender                     2333
relevant_experience       17608
enrolled_university        2333
education_level            2413
major_discipline           2421
experience                 2933
company_size               2565
company_type               2421
last_new_job               2421
training_hours             8804
job_change                17608
dtype: int64