<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Wrangling Lab**


Estimated time needed: **45** minutes


In this lab, you will perform data wrangling tasks to prepare raw data for analysis. Data wrangling involves cleaning, transforming, and organizing data into a structured format suitable for analysis. This lab focuses on tasks like identifying inconsistencies, encoding categorical variables, and feature transformation.


## Objectives


After completing this lab, you will be able to:


- Identify and remove inconsistent data entries.

- Encode categorical variables for analysis.

- Handle missing values using multiple imputation strategies.

- Apply feature scaling and transformation techniques.


#### Intsall the required libraries


In [265]:
#!pip install pandas
#!pip install matplotlib

## Tasks


#### Step 1: Import the necessary module.


### 1. Load the Dataset


<h5>1.1 Import necessary libraries and load the dataset.</h5>


Ensure the dataset is loaded correctly by displaying the first few rows.


In [266]:
# Import necessary libraries
import pandas as pd

# Load the Stack Overflow survey data
#dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
#Due to my bad internet connection, I prefer to download the dataset on my local machine, than to wait for it to download each time I run the code.

import os

dataset_url  = os.path.join(os.getcwd(), "survey_data.csv")

df = pd.read_csv(dataset_url)

# Display the first few rows
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

#### 2. Explore the Dataset


<h5>2.1 Summarize the dataset by displaying the column data types, counts, and missing values.</h5>


In [267]:
# Write your code here
#Data Types
print(df.dtypes,"\n\n")
print(f"Number of rows: {df.shape[0]}.  Number of columns: {df.shape[1]} \n\n")

#Missing values count
df.isnull().sum()

ResponseId               int64
MainBranch              object
Age                     object
Employment              object
RemoteWork              object
                        ...   
JobSatPoints_11        float64
SurveyLength            object
SurveyEase              object
ConvertedCompYearly    float64
JobSat                 float64
Length: 114, dtype: object 


Number of rows: 65437.  Number of columns: 114 




ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64

<h5>2.2 Generate basic statistics for numerical columns.</h5>


In [268]:
# Write your code here
df.describe()

Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,ConvertedCompYearly,JobSat
count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,23435.0,29126.0
mean,32719.0,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,10.955713,9.953948,86155.29,6.935041
std,18890.179119,5.444117e+147,9.168709,25.966221,18.422661,21.833836,27.08936,27.01774,26.10811,24.845032,22.906263,21.775652,186757.0,2.088259
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32712.0,6.0
50%,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,0.0,0.0,65000.0,7.0
75%,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,10.0,10.0,107971.5,8.0
max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,16256600.0,10.0


### 3. Identifying and Removing Inconsistencies


<h5>3.1 Identify inconsistent or irrelevant entries in specific columns (e.g., Country).</h5>


In [269]:
# Write your code here
import numpy as np
df["Country"].ffill()
"NaN" in df["Country"]

False

<h5>3.2 Standardize entries in columns like Country or EdLevel by mapping inconsistent values to a consistent format.</h5>


In [270]:
## Write your code here
df["EdLevel"] = df["EdLevel"].fillna(df["EdLevel"].mode())

mapping = {
    "Bachelor’s degree (B.A., B.S., B.Eng., etc.)": 'Bachelor’s Degree',
    "Master’s degree (M.A., M.S., M.Eng., MBA, etc.)": 'Master’s Degree',
    "Some college/university study without earning a degree": 'Some College',
    "Secondary school (e.g. American high school, German Realschule or Gymnasium, etc.)": 'High School',
    "Professional degree (JD, MD, Ph.D, Ed.D, etc.)": 'PhD',
    "Associate degree (A.A., A.S., etc.)": 'Associate',
    "Something else": 'Other'
}

df["EdLevel"] = df["EdLevel"].replace(mapping)

df["EdLevel"].value_counts()

EdLevel
Bachelor’s Degree            24942
Master’s Degree              15557
Some College                  7651
High School                   5793
PhD                           2970
Associate                     1793
Primary/elementary school     1146
Other                          932
Name: count, dtype: int64

### 4. Encoding Categorical Variables


<h5>4.1 Encode the Employment column using one-hot encoding.</h5>


In [271]:
## Write your code here
df_Employment_Encoded = pd.get_dummies(data = df, columns = ["Employment"])
df_Employment_Encoded

Unnamed: 0,ResponseId,MainBranch,Age,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,TechDoc,...,"Employment_Student, full-time;Not employed, but looking for work;Not employed, and not looking for work;Student, part-time","Employment_Student, full-time;Not employed, but looking for work;Retired","Employment_Student, full-time;Not employed, but looking for work;Student, part-time","Employment_Student, full-time;Retired","Employment_Student, full-time;Student, part-time","Employment_Student, full-time;Student, part-time;Employed, part-time","Employment_Student, full-time;Student, part-time;Retired","Employment_Student, part-time","Employment_Student, part-time;Employed, part-time","Employment_Student, part-time;Retired"
0,1,I am a developer by profession,Under 18 years old,Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,,...,False,False,False,False,False,False,False,False,False,False
1,2,I am a developer by profession,35-44 years old,Remote,Apples,Hobby;Contribute to open-source projects;Other...,Bachelor’s Degree,Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,...,False,False,False,False,False,False,False,False,False,False
2,3,I am a developer by profession,45-54 years old,Remote,Apples,Hobby;Contribute to open-source projects;Other...,Master’s Degree,Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,API document(s) and/or SDK document(s);User gu...,...,False,False,False,False,False,False,False,False,False,False
3,4,I am learning to code,18-24 years old,,Apples,,Some College,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,,...,False,False,False,False,False,False,False,False,False,False
4,5,I am a developer by profession,18-24 years old,,Apples,,High School,"Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
65432,65433,I am a developer by profession,18-24 years old,Remote,Apples,Hobby;School or academic work,Bachelor’s Degree,"On the job training;School (i.e., University, ...",,,...,False,False,False,False,False,False,False,False,False,False
65433,65434,I am a developer by profession,25-34 years old,Remote,Apples,Hobby;Contribute to open-source projects,,,,,...,False,False,False,False,False,False,False,False,False,False
65434,65435,I am a developer by profession,25-34 years old,In-person,Apples,Hobby,Bachelor’s Degree,"Other online resources (e.g., videos, blogs, f...",Technical documentation;Stack Overflow;Social ...,API document(s) and/or SDK document(s);AI-powe...,...,False,False,False,False,False,False,False,False,False,False
65435,65436,I am a developer by profession,18-24 years old,"Hybrid (some remote, some in-person)",Apples,Hobby;Contribute to open-source projects;Profe...,High School,On the job training;Other online resources (e....,Technical documentation;Blogs;Written Tutorial...,API document(s) and/or SDK document(s);User gu...,...,False,False,False,False,False,False,False,False,False,False


### 5. Handling Missing Values


<h5>5.1 Identify columns with the highest number of missing values.</h5>


In [272]:
## Write your code here
#Top 5 columns with the most missing values
df.isnull().sum().sort_values(ascending=False)[:5]

AINextMuch less integrated    64289
AINextLess integrated         63082
AINextNo change               52939
AINextMuch more integrated    51999
EmbeddedAdmired               48704
dtype: int64

<h5>5.2 Impute missing values in numerical columns (e.g., `ConvertedCompYearly`) with the mean or median.</h5>


In [273]:
## Write your code here
df["ConvertedCompYearly"] = df["ConvertedCompYearly"].fillna(df["ConvertedCompYearly"].mean())
df["ConvertedCompYearly"]

0        86155.287263
1        86155.287263
2        86155.287263
3        86155.287263
4        86155.287263
             ...     
65432    86155.287263
65433    86155.287263
65434    86155.287263
65435    86155.287263
65436    86155.287263
Name: ConvertedCompYearly, Length: 65437, dtype: float64

<h5>5.3 Impute missing values in categorical columns (e.g., `RemoteWork`) with the most frequent value.</h5>


In [274]:
## Write your code here
df["RemoteWork"] = df["RemoteWork"].fillna(df["RemoteWork"].mode()[0])
df["RemoteWork"].unique()

array(['Remote', 'Hybrid (some remote, some in-person)', 'In-person'],
      dtype=object)

### 6. Feature Scaling and Transformation


<h5>6.1 Apply Min-Max Scaling to normalize the `ConvertedCompYearly` column.</h5>


In [276]:
## Write your code here
df["ConvertedCompYearly"] = df["ConvertedCompYearly"].fillna(df["ConvertedCompYearly"].mode()[0])

df["ConvertedCompYearly_MinMax"] = (df["ConvertedCompYearly"] - df["ConvertedCompYearly"].min()) / (df["ConvertedCompYearly"].max() - df["ConvertedCompYearly"].min())
df["ConvertedCompYearly_MinMax"]

0        0.0053
1        0.0053
2        0.0053
3        0.0053
4        0.0053
          ...  
65432    0.0053
65433    0.0053
65434    0.0053
65435    0.0053
65436    0.0053
Name: ConvertedCompYearly_MinMax, Length: 65437, dtype: float64

<h5>6.2 Log-transform the ConvertedCompYearly column to reduce skewness.</h5>


In [278]:
## Write your code here
df["ConvertedCompYearly_log"] = np.log1p(df["ConvertedCompYearly"])
df["ConvertedCompYearly_log"]

0        11.363918
1        11.363918
2        11.363918
3        11.363918
4        11.363918
           ...    
65432    11.363918
65433    11.363918
65434    11.363918
65435    11.363918
65436    11.363918
Name: ConvertedCompYearly_log, Length: 65437, dtype: float64

### 7. Feature Engineering


<h5>7.1 Create a new column `ExperienceLevel` based on the `YearsCodePro` column:</h5>


In [287]:
## Write your code here
'''Unique values:
array([nan, '17', '27', '7', '11', '25', '12', '10', '3',
       'Less than 1 year', '18', '37', '15', '20', '6', '2', '16', '8',
       '14', '4', '45', '1', '24', '29', '5', '30', '26', '9', '33', '13',
       '35', '23', '22', '31', '19', '21', '28', '34', '32', '40', '50',
       '39', '44', '42', '41', '36', '38', 'More than 50 years', '43',
       '47', '48', '46', '49'], dtype=object)
'''

df["YearsCodePro"] = df["YearsCodePro"].fillna(df["YearsCodePro"].mode()[0])

#First I have to create the mapping dictionary
experience_mapping = {"Less than 1 year": "Entry Level",
                    "More than 50 years":"Expert Level"}

for years in range(1,51):
    key = str(years)
    if years <= 2:
        experience_mapping[key] = "Entry Level"
        
    elif years >= 3 and years <= 5:
        experience_mapping[key] = "Mid Level"
        
    elif years >= 6 and years <=10:
        experience_mapping[key] = "Senior Level"
    else:
        experience_mapping[key] = "Expert Level"

df["ExperienceLevel"] = df["YearsCodePro"].map(experience_mapping)

df[["YearsCodePro","ExperienceLevel"]]


Unnamed: 0,YearsCodePro,ExperienceLevel
0,2,Entry Level
1,17,Expert Level
2,27,Expert Level
3,2,Entry Level
4,2,Entry Level
...,...,...
65432,3,Mid Level
65433,2,Entry Level
65434,5,Mid Level
65435,2,Entry Level


### Summary


In this lab, you:

- Explored the dataset to identify inconsistencies and missing values.

- Encoded categorical variables for analysis.

- Handled missing values using imputation techniques.

- Normalized and transformed numerical data to prepare it for analysis.

- Engineered a new feature to enhance data interpretation.


Copyright © IBM Corporation. All rights reserved.
