# Introduction

In the realm of data science, a common challenge faced when developing models to derive business insights is the sheer size of datasets. Large datasets can significantly prolong the time it takes for models to generate predictions, sometimes even stretching into days. Therefore, ensuring that datasets are stored as efficiently as possible becomes paramount. Efficient storage solutions not only enhance the speed of model predictions but also prevent the need to reduce dataset size, which could compromise the integrity and accuracy of analyses.

This project is undertaken in collaboration with **Training Data Ltd.**, a prominent online data science training provider. The objective is to clean and optimize one of their largest customer datasets, which will ultimately serve to predict whether students are in search of new job opportunities. This predictive information is crucial for directing students to potential recruiters.

The dataset provided for this proof-of-concept is **customer_train.csv**, a curated subset of the entire customer dataset. It contains anonymized information about students, including their job-seeking status during the training period.

The Head Data Scientist at Training Data Ltd. has outlined specific requirements for creating a more efficient DataFrame, referred to as `ds_jobs_transformed`, from the **customer_train.csv** dataset:

- **Boolean Storage**: Columns with categories limited to two factors must be represented as Booleans (`bool`).
- **Integer Storage**: Columns containing only integers should utilize 32-bit integer storage (`int32`).
- **Float Storage**: Columns with floating-point numbers should be stored as 16-bit floats (`float16`).
- **Nominal Categorical Data**: Columns with nominal categorical data should be designated as the `category` data type.
- **Ordinal Categorical Data**: Columns with ordinal categorical data must be treated as ordered categories without mapping to numerical values, reflecting their natural order.

Furthermore, the DataFrame will be filtered to include only students with **10 or more years of experience** at companies with a minimum of **1,000 employees**. This filtering aligns with the needs of their recruiter base, which focuses on experienced professionals from enterprise-level companies.

Upon completing the preprocessing steps, invoking the `.info()` or `.memory_usage()` methods on `ds_jobs` and `ds_jobs_transformed` should reveal a significant reduction in memory usage, demonstrating the effectiveness of the applied strategies.

Let's proceed with the data cleaning and transformation process!


In [2]:
import pandas as pd

In [4]:
df = pd.read_csv('Dataset/customer_train.csv')

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   student_id              19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevant_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  job_change              19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [12]:
df.memory_usage().sum()

2145828

We need to optimize the memory usage of the dataset, starting with a baseline of 2,145,828 bytes.

In [6]:
# Create a copy of ds_jobs for transforming
df_transformed = df.copy()

In [7]:
df_transformed.head()

Unnamed: 0,student_id,city,city_development_index,gender,relevant_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,job_change
0,8949,city_103,0.92,Male,Has relevant experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevant experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevant experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevant experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevant experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [8]:
df_transformed.shape

(19158, 14)

In [10]:
df_transformed.dtypes


student_id                  int64
city                       object
city_development_index    float64
gender                     object
relevant_experience        object
enrolled_university        object
education_level            object
major_discipline           object
experience                 object
company_size               object
company_type               object
last_new_job               object
training_hours              int64
job_change                float64
dtype: object

In [None]:
We begin by converting the job change column to a Boolean type, where 1 represents True (indicating that the individual is looking for a new job) and 0 represents False (indicating that they are not seeking new job opportunities).


In [19]:
df_transformed['job_change'] = df_transformed['job_change'].apply(lambda row: True if row ==1 else False)

In [20]:
df_transformed['job_change']

0         True
1        False
2        False
3         True
4        False
         ...  
19153     True
19154     True
19155    False
19156    False
19157    False
Name: job_change, Length: 19158, dtype: bool

In [22]:
df_transformed['job_change'].value_counts()

job_change
False    14381
True      4777
Name: count, dtype: int64

Next, we apply a transformation to the experience column, replacing its values to ensure that only numeric values are present. Subsequently, we will store this data as 32-bit integers (int32).

In [24]:
df_transformed['experience'].unique

<bound method Series.unique of 0        >20
1         15
2          5
3         <1
4        >20
        ... 
19153     14
19154     14
19155    >20
19156     <1
19157      2
Name: experience, Length: 19158, dtype: object>

In [25]:
df_transformed['experience'] = df['experience'].apply(lambda x: str(x).replace('<', '').replace('>', '').strip())

In [26]:
df_transformed['experience'].unique

<bound method Series.unique of 0        20
1        15
2         5
3         1
4        20
         ..
19153    14
19154    14
19155    20
19156     1
19157     2
Name: experience, Length: 19158, dtype: object>

In [32]:
df_transformed['experience'].dtypes

dtype('O')

In [33]:
df_transformed['experience'] = int(df_transformed['experience'])

TypeError: cannot convert the series to <class 'int'>

In [34]:
df_transformed['experience'].isna().sum()

0

In [36]:
df_transformed['experience'] = pd.to_numeric(df_transformed['experience'], errors='coerce')


In [37]:
df_transformed['experience'] = df_transformed['experience'].fillna(0).astype('int32')

In [39]:
df_transformed['experience'].dtype

dtype('int32')

In [40]:
df.memory_usage().sum()

2145828

In [41]:
df_transformed["training_hours"].value_counts()

training_hours
28     329
12     292
18     291
22     282
50     279
      ... 
266      6
234      5
272      5
286      5
238      4
Name: count, Length: 241, dtype: int64

In [43]:
df_transformed["training_hours"].isna().sum()

0

In [44]:
df_transformed["training_hours"] = df_transformed["training_hours"].fillna(0).astype('int32')

In [45]:
df_transformed['training_hours'].dtype

dtype('int32')