<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Data Wrangling Lab**


Estimated time needed: **45** minutes


In this lab, you will perform data wrangling tasks to prepare raw data for analysis. Data wrangling involves cleaning, transforming, and organizing data into a structured format suitable for analysis. This lab focuses on tasks like identifying inconsistencies, encoding categorical variables, and feature transformation.


## Objectives


After completing this lab, you will be able to:


- Identify and remove inconsistent data entries.

- Encode categorical variables for analysis.

- Handle missing values using multiple imputation strategies.

- Apply feature scaling and transformation techniques.


#### Intsall the required libraries


In [1]:
!pip install pandas
!pip install matplotlib



## Tasks


#### Step 1: Import the necessary module.


### 1. Load the Dataset


<h5>1.1 Import necessary libraries and load the dataset.</h5>


Ensure the dataset is loaded correctly by displaying the first few rows.


In [2]:
# Import necessary libraries
import pandas as pd

# Load the Stack Overflow survey data
dataset_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(dataset_url)

# Display the first few rows
print(df.head())


   ResponseId                      MainBranch                 Age  \
0           1  I am a developer by profession  Under 18 years old   
1           2  I am a developer by profession     35-44 years old   
2           3  I am a developer by profession     45-54 years old   
3           4           I am learning to code     18-24 years old   
4           5  I am a developer by profession     18-24 years old   

            Employment RemoteWork   Check  \
0  Employed, full-time     Remote  Apples   
1  Employed, full-time     Remote  Apples   
2  Employed, full-time     Remote  Apples   
3   Student, full-time        NaN  Apples   
4   Student, full-time        NaN  Apples   

                                    CodingActivities  \
0                                              Hobby   
1  Hobby;Contribute to open-source projects;Other...   
2  Hobby;Contribute to open-source projects;Other...   
3                                                NaN   
4                                 

In [3]:
from IPython.display import HTML, display
def summarize_dataset(df):
    """
    Creates a detailed summary of the dataset with HTML formatting
    """
    summary = {
        'Column': df.columns,
        'Data Type': df.dtypes,
        'Count': df.count(),
        'Missing': df.isnull().sum(),
        'Missing %': (df.isnull().sum() / len(df) * 100).round(2),
        'Unique Values': df.nunique()
    }
    
    summary_df = pd.DataFrame(summary)
    
    html_content = """
    <style>
        .summary-container {
            margin: 20px;
            font-family: Arial, sans-serif;
        }
        .summary-table {
            width: 100%;
            border-collapse: collapse;
            margin: 10px 0;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }
        .summary-table th {
            background-color: #1976D2;
            color: white;
            padding: 12px;
            text-align: left;
        }
        .summary-table td {
            padding: 10px;
            border-bottom: 1px solid #ddd;
        }
        .summary-table tr:nth-child(even) {
            background-color: #f5f5f5;
        }
        .missing-high {
            color: #d32f2f;
            font-weight: bold;
        }
    </style>
    
    <div class='summary-container'>
        <h2>Dataset Summary</h2>
        <table class='summary-table'>
            <tr>
                <th>Column</th>
                <th>Data Type</th>
                <th>Count</th>
                <th>Missing</th>
                <th>Missing %</th>
                <th>Unique Values</th>
            </tr>
    """
    
    for idx, row in summary_df.iterrows():
        missing_class = 'missing-high' if row['Missing %'] > 5 else ''
        html_content += f"""
            <tr>
                <td>{row['Column']}</td>
                <td>{row['Data Type']}</td>
                <td>{row['Count']}</td>
                <td class='{missing_class}'>{row['Missing']}</td>
                <td class='{missing_class}'>{row['Missing %']}%</td>
                <td>{row['Unique Values']}</td>
            </tr>
        """
    
    html_content += "</table></div>"
    display(HTML(html_content))
    
    return summary_df

In [4]:
def analyze_numerical_columns(df):
    """
    Generates comprehensive statistics with enhanced spacing
    """
    numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
    stats = df[numerical_cols].agg([
        'count', 'mean', 'median', 'std', 'min', 
        lambda x: x.quantile(0.25),
        lambda x: x.quantile(0.75),
        'max'
    ]).round(2)
    
    stats.index = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', '25th', '75th', 'Max']
    
    html_content = """
    <style>
        .stats-container {
            overflow-x: auto;
            margin: 20px 0;
        }
        .stats-table {
            min-width: 800px;
            width: 100%;
            border-collapse: separate;
            border-spacing: 0;
            margin: 20px 0;
            font-family: Arial, sans-serif;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }
        .stats-table th, .stats-table td {
            padding: 12px 15px;
            text-align: right;
            border-bottom: 1px solid #ddd;
            white-space: nowrap;
        }
        .stats-table th {
            background-color: #1976D2;
            color: white;
            position: sticky;
            top: 0;
        }
        .stats-table tr:nth-child(even) {
            background-color: #f5f5f5;
        }
        .value {
            font-weight: bold;
            color: #2e7d32;
        }
        .metric-name {
            text-align: left !important;
            font-weight: bold;
        }
    </style>
    
    <div class='stats-container'>
        <h2>Numerical Column Statistics</h2>
        <table class='stats-table'>
    """
    
    # Add headers
    html_content += "<tr><th class='metric-name'>Metric</th>"
    html_content += "".join([f"<th>{col}</th>" for col in numerical_cols]) + "</tr>"
    
    # Add rows
    for idx, row in stats.iterrows():
        html_content += f"<tr><td class='metric-name'>{idx}</td>"
        html_content += "".join([f"<td class='value'>${val:,.2f}</td>" if 'Converted' in col 
                               else f"<td class='value'>{val:,.2f}</td>" 
                               for col, val in row.items()]) + "</tr>"
    
    html_content += "</table></div>"
    display(HTML(html_content))
    
    return stats

In [5]:
def analyze_multiple_columns_consistency(df, columns):
    """
    Analyzes consistency across multiple columns simultaneously
    """
    html_content = """
    <style>
        .multi-consistency-container {
            margin: 20px;
            font-family: Arial, sans-serif;
            display: grid;
            grid-template-columns: repeat(auto-fit, minmax(300px, 1fr));
            gap: 20px;
        }
        .column-card {
            background-color: gray;
            border-radius: 8px;
            padding: 15px;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }
        .value-list {
            max-height: 200px;
            overflow-y: auto;
            padding: 10px;
            background-color: black;
            border-radius: 4px;
            margin-top: 10px;
        }
        .highlight {
            color: #1976D2;
            font-weight: bold;
        }
    </style>
    
    <div class='multi-consistency-container'>
    """
    
    analysis_results = {}
    
    for column in columns:
        value_counts = df[column].value_counts()
        rare_values = value_counts[value_counts < len(df) * 0.01]
        
        html_content += f"""
        <div class='column-card'>
            <h3>{column}</h3>
            <p>Unique Values: <span class='highlight'>{len(value_counts)}</span></p>
            <p>Rare Values (<1%): <span class='highlight'>{len(rare_values)}</span></p>
            
            <h4>Top 5 Most Common:</h4>
            <div class='value-list'>
                {value_counts.head().to_frame().to_html()}
            </div>
            
            <h4>Sample Rare Values:</h4>
            <div class='value-list'>
                {rare_values.head().to_frame().to_html()}
            </div>
        </div>
        """
        
        analysis_results[column] = {
            'value_counts': value_counts,
            'rare_values': rare_values
        }
    
    html_content += "</div>"
    display(HTML(html_content))
    
    return analysis_results

In [6]:
def standardize_column_values(df, column_mappings):
    """
    Standardizes column values using predefined mappings
    Returns standardized DataFrame and mapping statistics
    """
    df_standardized = df.copy()
    stats = {}
    
    html_content = """
    <style>
        .standard-container {
            margin: 20px;
            font-family: Arial, sans-serif;
        }
        .mapping-card {
            background-color: gray;
            border-radius: 8px;
            padding: 15px;
            margin: 10px 0;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }
        .mapping-table {
            width: 100%;
            border-collapse: collapse;
        }
        .mapping-table th {
            background-color: #1976D2;
            color: black;
            padding: 10px;
        }
        .mapping-table td {
            padding: 8px;
            border-bottom: 1px solid #ddd;
        }
    </style>
    """
    
    for column, mapping in column_mappings.items():
        # Apply mapping
        original_values = df_standardized[column].nunique()
        df_standardized[column] = df_standardized[column].map(mapping).fillna(df_standardized[column])
        standardized_values = df_standardized[column].nunique()
        
        stats[column] = {
            'original_unique': original_values,
            'standardized_unique': standardized_values,
            'values_mapped': len(mapping)
        }
        
        html_content += f"""
        <div class='standard-container'>
            <div class='mapping-card'>
                <h3>Standardization Results: {column}</h3>
                <p>Original Unique Values: <b>{original_values}</b></p>
                <p>Standardized Unique Values: <b>{standardized_values}</b></p>
                <p>Values Mapped: <b>{len(mapping)}</b></p>
                
                <h4>Mapping Details:</h4>
                <table class='mapping-table'>
                    <tr>
                        <th>Original Value</th>
                        <th>Standardized Value</th>
                    </tr>
                    {''.join(f"<tr><td>{k}</td><td>{v}</td></tr>" for k, v in mapping.items())}
                </table>
            </div>
        </div>
        """
    
    display(HTML(html_content))
    return df_standardized, stats


In [7]:
# Function to clean the column
import string
def clean_column_name(category):
    return category.replace(' ', '_').replace(',', '').translate(str.maketrans('', '', string.punctuation))


#print(df['Employment'].unique())
df['Emp'] = df['Employment'].str.split(';')
#df['Emp'] = df['Emp'].apply(lambda x: [i.replace(' ', '_').replace(',', '') for i in x])
unique_categories = set([item for sublist in df['Emp'] for item in sublist])
#print(unique_categories)
# Step 3: Create one-hot encoding for each category
for category in unique_categories:
    colname=clean_column_name(category)
    df[colname] = df['Employment'].apply(lambda x: 1 if category in x else 0)
#df['Emp'] = df['Emp'].apply(lambda x: [i.replace(' ', '_').replace(',', '') for i in x])

print(df['Employedfulltime'].value_counts())

'''encoded_df = pd.get_dummies(df['Employment'], prefix='Employment')
#encoded_df
# Melt the DataFrame to have one value per row
df_melted = encoded_df.melt(value_name='Employment_Status').dropna()
#df_melted
# One-hot encode the 'Employment_Status' column
one_hot_encoded_df = pd.get_dummies(df_melted['Employment_Status'])
one_hot_encoded_df
#df_encoded = pd.concat([df, one_hot_encoded_df], axis=1)

#print(df_encoded)
'''

Employedfulltime
1    45162
0    20275
Name: count, dtype: int64


"encoded_df = pd.get_dummies(df['Employment'], prefix='Employment')\n#encoded_df\n# Melt the DataFrame to have one value per row\ndf_melted = encoded_df.melt(value_name='Employment_Status').dropna()\n#df_melted\n# One-hot encode the 'Employment_Status' column\none_hot_encoded_df = pd.get_dummies(df_melted['Employment_Status'])\none_hot_encoded_df\n#df_encoded = pd.concat([df, one_hot_encoded_df], axis=1)\n\n#print(df_encoded)\n"

In [8]:
import string
def clean_column_name(category):
    return category.replace(' ', '_').replace(',', '').translate(str.maketrans('', '', string.punctuation))

def create_advanced_encoding(df, column='Employment'):
    """
    Creates one-hot encoding using split and transform logic
    """
    df_encoded = df.copy()
    
    # Split and clean categories
    df_encoded['Emp'] = df_encoded[column].str.split(';')
    #df_encoded['Emp'] = df_encoded['Emp'].apply(
    #    lambda x: [i.strip().replace(' ', '_').replace(',', '') for i in x]
    #)
    
    # Get unique categories
    unique_categories = set([item for sublist in df_encoded['Emp'] for item in sublist])
    
    # Create one-hot columns
    for category in unique_categories:
        colname=clean_column_name(category)
        df_encoded[f'Employment_{colname}'] = df_encoded[column].apply(
            lambda x: 1 if category in x else 0
        )
    
    # Display results
    html_content = f"""
    <style>
        .encode-container {{
            margin: 20px;
            font-family: Arial, sans-serif;
        }}
        .result-card {{
            background-color: black;
            border-radius: 8px;
            padding: 15px;
            margin: 10px 0;
            box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        }}
        .category-list {{
            columns: 3;
            list-style-type: none;
            padding: 0;
        }}
    </style>
    
    <div class='encode-container'>
        <div class='result-card'>
            <h3>Advanced One-Hot Encoding Results</h3>
            <p>Total Categories Encoded: <b>{len(unique_categories)}</b></p>
            <h4>Generated Columns:</h4>
            <ul class='category-list'>
                {''.join(f"<li>Employment_{clean_column_name(category)}</li>" for category in sorted(unique_categories))}
            </ul>
        </div>
    </div>
    """
    
    display(HTML(html_content))
    return df_encoded, unique_categories

#### 2. Explore the Dataset


<h5>2.1 Summarize the dataset by displaying the column data types, counts, and missing values.</h5>


In [9]:
# Write your code here
df.describe(include=None)

Unnamed: 0,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,...,JobSat,Employedparttime,Notemployedbutlookingforwork,Studentfulltime,Notemployedandnotlookingforwork,Independentcontractorfreelancerorselfemployed,Employedfulltime,Iprefernottosay,Retired,Studentparttime
count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,...,29126.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0
mean,32719.0,2.963841e+145,11.466957,18.581094,7.52214,10.060857,24.343232,22.96522,20.278165,16.169432,...,6.935041,0.063343,0.060425,0.131821,0.018384,0.163913,0.69016,0.008344,0.010407,0.040589
std,18890.179119,5.444117e+147,9.168709,25.966221,18.422661,21.833836,27.08936,27.01774,26.10811,24.845032,...,2.088259,0.243581,0.238274,0.338299,0.134337,0.3702,0.462431,0.090964,0.101483,0.197337
min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,...,7.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
75%,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,...,8.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,...,10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [10]:
summary = {
        'Column': df.columns,
        'Data Type': df.dtypes,
        'Count': df.count(),
        'Missing': df.isnull().sum(),
        'Missing %': (df.isnull().sum() / len(df) * 100).round(2)
    }
    
summary_df = pd.DataFrame(summary)
summary_df
#summarize_dataset(df)

Unnamed: 0,Column,Data Type,Count,Missing,Missing %
ResponseId,ResponseId,int64,65437,0,0.00
MainBranch,MainBranch,object,65437,0,0.00
Age,Age,object,65437,0,0.00
Employment,Employment,object,65437,0,0.00
RemoteWork,RemoteWork,object,54806,10631,16.25
...,...,...,...,...,...
Independentcontractorfreelancerorselfemployed,Independentcontractorfreelancerorselfemployed,int64,65437,0,0.00
Employedfulltime,Employedfulltime,int64,65437,0,0.00
Iprefernottosay,Iprefernottosay,int64,65437,0,0.00
Retired,Retired,int64,65437,0,0.00


<h5>2.2 Generate basic statistics for numerical columns.</h5>


In [11]:
# Write your code here
# Select numerical columns
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
    
    # Calculate statistics
stats = df[numerical_cols].agg([
        'count', 'mean', 'median', 'std', 'min', 
        lambda x: x.quantile(0.25),
        lambda x: x.quantile(0.75),
        'max'
    ]).round(2)
    
stats.index = ['Count', 'Mean', 'Median', 'Std Dev', 'Min', '25th', '75th', 'Max']
res=analyze_numerical_columns(df)

Metric,ResponseId,CompTotal,WorkExp,JobSatPoints_1,JobSatPoints_4,JobSatPoints_5,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,ConvertedCompYearly,JobSat,Employedparttime,Notemployedbutlookingforwork,Studentfulltime,Notemployedandnotlookingforwork,Independentcontractorfreelancerorselfemployed,Employedfulltime,Iprefernottosay,Retired,Studentparttime
Count,65437.0,33740.0,29658.0,29324.0,29393.0,29411.0,29450.0,29448.0,29456.0,29456.0,29450.0,29445.0,"$23,435.00",29126.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0,65437.0
Mean,32719.0,2.9638411381149976e+145,11.47,18.58,7.52,10.06,24.34,22.97,20.28,16.17,10.96,9.95,"$86,155.29",6.94,0.06,0.06,0.13,0.02,0.16,0.69,0.01,0.01,0.04
Median,32719.0,110000.0,9.0,10.0,0.0,0.0,20.0,15.0,10.0,5.0,0.0,0.0,"$65,000.00",7.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Std Dev,18890.18,5.444117135142297e+147,9.17,25.97,18.42,21.83,27.09,27.02,26.11,24.85,22.91,21.78,"$186,756.97",2.09,0.24,0.24,0.34,0.13,0.37,0.46,0.09,0.1,0.2
Min,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,$1.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25th,16360.0,60000.0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"$32,712.00",6.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75th,49078.0,250000.0,16.0,22.0,5.0,10.0,30.0,30.0,25.0,20.0,10.0,10.0,"$107,971.50",8.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Max,65437.0,1e+150,50.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,100.0,"$16,256,603.00",10.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


### 3. Identifying and Removing Inconsistencies


<h5>3.1 Identify inconsistent or irrelevant entries in specific columns (e.g., Country).</h5>


In [12]:
# Write your code here
value_counts = df['Country'].value_counts()
rare_values = value_counts[value_counts < len(df) * 0.01]
#print(rare_values)
columns=['Country', 'Employment', 'DevType']
analysis_results=analyze_multiple_columns_consistency(df,columns)
#print(analysis_results)

Unnamed: 0_level_0,count
Country,Unnamed: 1_level_1
United States of America,11095
Germany,4947
India,4231
United Kingdom of Great Britain and Northern Ireland,3224
Ukraine,2672

Unnamed: 0_level_0,count
Country,Unnamed: 1_level_1
Israel,604
Turkey,546
Belgium,526
Denmark,504
Portugal,470

Unnamed: 0_level_0,count
Employment,Unnamed: 1_level_1
"Employed, full-time",39041
"Independent contractor, freelancer, or self-employed",4846
"Student, full-time",4709
"Employed, full-time;Independent contractor, freelancer, or self-employed",3557
"Not employed, but looking for work",2341

Unnamed: 0_level_0,count
Employment,Unnamed: 1_level_1
"Not employed, and not looking for work",633
"Student, part-time;Employed, part-time",558
I prefer not to say,546
Retired,525
"Student, part-time",494

Unnamed: 0_level_0,count
DevType,Unnamed: 1_level_1
"Developer, full-stack",18260
"Developer, back-end",9928
Student,5102
"Developer, front-end",3349
"Developer, desktop or enterprise applications",2493

Unnamed: 0_level_0,count
DevType,Unnamed: 1_level_1
Cloud infrastructure engineer,634
System administrator,552
"Developer, AI",543
"Developer, QA or test",525
Data or business analyst,523


<h5>3.2 Standardize entries in columns like Country or EdLevel by mapping inconsistent values to a consistent format.</h5>


In [13]:
## Write your code here
mappings = {
    'Country': {
        'United States of America': 'USA',
        'United States': 'USA',
        'US': 'USA',
        'United Kingdom': 'UK',
        'Great Britain': 'UK'
    },
    'EdLevel': {
        "Bachelor's degree": "Bachelor's",
        "Master's degree": "Master's",
        'High School': 'Secondary',
        'Secondary school': 'Secondary'
    }
}

# Execute standardization
standardized_df, mapping_stats = standardize_column_values(df, mappings)

Original Value,Standardized Value
United States of America,USA
United States,USA
US,USA
United Kingdom,UK
Great Britain,UK

Original Value,Standardized Value
Bachelor's degree,Bachelor's
Master's degree,Master's
High School,Secondary
Secondary school,Secondary


### 4. Encoding Categorical Variables


<h5>4.1 Encode the Employment column using one-hot encoding.</h5>


In [14]:
## Write your code here
regex = [r'employed[\s,]*full[-\s]*time','employed[\s,]*part[-\s]*time','student[\s,]*full[-\s]*time','student[\s,]*part[-\s]*time','Retired']
cols=['Employment_Employedfulltime','Employment_Employedparttime','Employment_Studentfulltime','Employment_Studentparttime','Employment_Retired']
encoded_df, categories = create_advanced_encoding(df)
for r in regex:
    emps = df[df['Employment'].str.contains(r, na=False, case=False)]
    print(r,emps['Employment'].value_counts(ascending=False).sum())

for c in cols:
    vc = encoded_df[c].value_counts()
    if 1 in vc.index:
        print(f"Total count of 1's in {c}: {vc[1]}")
    else:
        print(f"No 1's found in {c}")


employed[\s,]*full[-\s]*time 45162
employed[\s,]*part[-\s]*time 4145
student[\s,]*full[-\s]*time 8626
student[\s,]*part[-\s]*time 2656
Retired 681
Total count of 1's in Employment_Employedfulltime: 45162
Total count of 1's in Employment_Employedparttime: 4145
Total count of 1's in Employment_Studentfulltime: 8626
Total count of 1's in Employment_Studentparttime: 2656
Total count of 1's in Employment_Retired: 681


### 5. Handling Missing Values


<h5>5.1 Identify columns with the highest number of missing values.</h5>


In [15]:
## Write your code here
missing_values = df.isnull().sum().sort_values(ascending=False)
print(missing_values.head())

AINextMuch less integrated    64289
AINextLess integrated         63082
AINextNo change               52939
AINextMuch more integrated    51999
EmbeddedAdmired               48704
dtype: int64


<h5>5.2 Impute missing values in numerical columns (e.g., `ConvertedCompYearly`) with the mean or median.</h5>


In [16]:
print('Count Before',df['ConvertedCompYearly'].isna().sum())

Count Before 42002


In [17]:
## Write your code here
df['ConvertedCompYearly'] = df['ConvertedCompYearly'].fillna(df['ConvertedCompYearly'].median())
print('Count After',df['ConvertedCompYearly'].isna().sum())

Count After 0


<h5>5.3 Impute missing values in categorical columns (e.g., `RemoteWork`) with the most frequent value.</h5>


In [18]:
print('Count Before',df['RemoteWork'].isna().sum())

Count Before 10631


In [19]:
## Write your code here
most_frequent_value = df['RemoteWork'].mode()[0]
df['RemoteWork'] = df['RemoteWork'].fillna(most_frequent_value)
print('Count After',df['RemoteWork'].isna().sum())

Count After 0


### 6. Feature Scaling and Transformation


<h5>6.1 Apply Min-Max Scaling to normalize the `ConvertedCompYearly` column.</h5>


In [20]:
conCompY=df.copy()
column=['ConvertedCompYearly']
print(f"'min': {conCompY[column].min()} \n'max': {conCompY[column].max()}")

'min': ConvertedCompYearly    1.0
dtype: float64 
'max': ConvertedCompYearly    16256603.0
dtype: float64


In [21]:
## Write your code here
column=['ConvertedCompYearly']
min_val = conCompY[column].min()
max_val = conCompY[column].max()
conCompY[column] = (conCompY[column] - min_val) / (max_val - min_val)
print(f"'min': {conCompY[column].min()} \n'max': {conCompY[column].max()}")

'min': ConvertedCompYearly    0.0
dtype: float64 
'max': ConvertedCompYearly    1.0
dtype: float64


<h5>6.2 Log-transform the ConvertedCompYearly column to reduce skewness.</h5>


In [23]:
conCompY=df.copy()
column=['ConvertedCompYearly']
conCompY['ConvertedCompYearly'] = conCompY['ConvertedCompYearly'].apply(lambda x: np.nan if x <= 0 else x)
print(conCompY[column].isnull().sum())
print(f"'min': {conCompY[column].min()} \n'max': {conCompY[column].max()}")

ConvertedCompYearly    0
dtype: int64
'min': ConvertedCompYearly    1.0
dtype: float64 
'max': ConvertedCompYearly    16256603.0
dtype: float64


In [25]:
## Write your code here
import numpy as np
conCompY['ConvertedCompYearly-log'] = np.log(conCompY['ConvertedCompYearly'])
print(f"'min': {conCompY[column].min()} \n'max': {conCompY[column].max()}")
print(f"'min': {conCompY['ConvertedCompYearly-log'].min()} \n'max': {conCompY['ConvertedCompYearly-log'].max()}")

'min': ConvertedCompYearly    1.0
dtype: float64 
'max': ConvertedCompYearly    16256603.0
dtype: float64
'min': 0.0 
'max': 16.604009722668444


### 7. Feature Engineering


<h5>7.1 Create a new column `ExperienceLevel` based on the `YearsCodePro` column:</h5>


In [71]:
def classify_experience_level(years):
    # Handle non-numeric categories
    
    if isinstance(years, str):
        try:
            if 'Less than 1 year' in years:
                return 'Beginner'  # Less than 1 year to Beginner
            elif 'More than 50 years' in years:
                return 'Advanced'  # More than 50 years to Advanced
            years = float(years)
            if isinstance(years, (int, float)):
                if years <= 2:
                    return 'Beginner'
                elif 3 <= years <= 5:
                    return 'Intermediate'
                elif 6 <= years <= 10:
                    return 'Experienced'
                elif years > 10:
                    return 'Advanced'
            
        except ValueError:
            return None,'Error'
    #else:
        #    return 'Unknown'  # Handle unexpected categories
    elif pd.isna(years):
        
        return 'Null-Unknown'  # Handle missing data
    else:
        return 'Not Found'

def convert_years_code(value):
    if isinstance(value, str):
        if "Less than 1 year" in value:
            return 0  # "Less than 1 year" becomes 0
        elif "More than 50 years" in value:
            return 51  # Arbitrary large number for "More than 50 years"
        else:
            try:
                return float(value)  # Try converting to float for valid numeric strings
            except ValueError:
                return None  # Return None if the value can't be converted
    elif pd.isna(value): return -1.0
    else:
        return value  # Return the value if it's already a number


In [73]:
## Write your code here
df['ExperienceLevel'] = df['YearsCodePro'].apply(classify_experience_level)
df['YearsCodePro_int']=df['YearsCodePro'].apply(convert_years_code)
#print(df['YearsCodePro_int'])
#print(df[['YearsCodePro', 'ExperienceLevel']].value_counts(ascending=False))
print(df['ExperienceLevel'].value_counts(ascending=False),df['YearsCodePro_int'].value_counts(ascending=True).sort_index())

ExperienceLevel
Advanced        18460
Null-Unknown    13827
Experienced     12653
Intermediate    10834
Beginner         9663
Name: count, dtype: int64 YearsCodePro_int
-1.0     13827
 0.0      2856
 1.0      2639
 2.0      4168
 3.0      4093
 4.0      3215
 5.0      3526
 6.0      2843
 7.0      2517
 8.0      2549
 9.0      1493
 10.0     3251
 11.0     1312
 12.0     1777
 13.0     1127
 14.0     1082
 15.0     1635
 16.0      946
 17.0      814
 18.0      867
 19.0      516
 20.0     1549
 21.0      380
 22.0      492
 23.0      448
 24.0      632
 25.0      998
 26.0      426
 27.0      380
 28.0      342
 29.0      196
 30.0      689
 31.0      106
 32.0      194
 33.0      132
 34.0      169
 35.0      285
 36.0      119
 37.0      104
 38.0      134
 39.0       54
 40.0      194
 41.0       51
 42.0       55
 43.0       37
 44.0       42
 45.0       56
 46.0       21
 47.0       10
 48.0       14
 49.0       11
 50.0       14
 51.0       50
Name: count, dtype: int64


### Summary

In this lab, you:

- Explored the dataset to identify inconsistencies and missing values.

- Encoded categorical variables for analysis.

- Handled missing values using imputation techniques.

- Normalized and transformed numerical data to prepare it for analysis.

- Engineered a new feature to enhance data interpretation.


Copyright © IBM Corporation. All rights reserved.
