<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo"  />
    </a>
</p>


# **Impute Missing Values**


Estimated time needed: **30** minutes


In this lab, you will practice essential data wrangling techniques using the Stack Overflow survey dataset. The primary focus is on handling missing data and ensuring data quality. You will:

- **Load the Data:** Import the dataset into a DataFrame using the pandas library.

- **Clean the Data:** Identify and remove duplicate entries to maintain data integrity.

- **Handle Missing Values:** Detect missing values, impute them with appropriate strategies, and verify the imputation to create a complete and reliable dataset for analysis.

This lab equips you with the skills to effectively preprocess and clean real-world datasets, a crucial step in any data analysis project.


このラボでは、Stack Overflow Survey Datasetを使用して、必須のデータラングリングテクニックを練習します。主な焦点は、欠落データの処理とデータ品質の確保にあります。あなたはするであろう：

 - **データのロード：**PANDASライブラリを使用してデータフレームにデータセットをインポートします。

 - **データのクリーニング：**データの整合性を維持するために、重複したエントリを識別して削除します。
 - **欠損値を処理します：**欠損値を検出し、適切な戦略でそれらを担当し、代入を検証して分析用の完全で信頼できるデータセットを作成します。

このラボは、データ分析プロジェクトの重要なステップである、実世界のデータセットを効果的に前処理および清掃するスキルを装備しています。

## Objectives


In this lab, you will perform the following:


-   Identify missing values in the dataset.

-   Apply techniques to impute missing values in the dataset.
  
-   Use suitable techniques to normalize data in the dataset.


 -データセット内の欠損値を識別します。

 -データセットに欠損値を付与するための手法を適用します。
  
 -適切な手法を使用して、データセット内のデータを正規化します。


-----


#### Install needed library


In [None]:
# !pip install pandas

### Step 1: Import Required Libraries


In [1]:
import pandas as pd

### Step 2: Load the Dataset Into a Dataframe


#### **Read Data**
<p>
The functions below will download the dataset into your browser:
</p>


In [2]:
file_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/n01PQ9pSmiRX6520flujwQ/survey-data.csv"
df = pd.read_csv(file_path)

# Display the first few rows to ensure it loaded correctly
df.head()

Unnamed: 0,ResponseId,MainBranch,Age,Employment,RemoteWork,Check,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,...,JobSatPoints_6,JobSatPoints_7,JobSatPoints_8,JobSatPoints_9,JobSatPoints_10,JobSatPoints_11,SurveyLength,SurveyEase,ConvertedCompYearly,JobSat
0,1,I am a developer by profession,Under 18 years old,"Employed, full-time",Remote,Apples,Hobby,Primary/elementary school,Books / Physical media,,...,,,,,,,,,,
1,2,I am a developer by profession,35-44 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Bachelor’s degree (B.A., B.S., B.Eng., etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,0.0,0.0,0.0,0.0,0.0,0.0,,,,
2,3,I am a developer by profession,45-54 years old,"Employed, full-time",Remote,Apples,Hobby;Contribute to open-source projects;Other...,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Colleague;On the job tr...,Technical documentation;Blogs;Books;Written Tu...,...,,,,,,,Appropriate in length,Easy,,
3,4,I am learning to code,18-24 years old,"Student, full-time",,Apples,,Some college/university study without earning ...,"Other online resources (e.g., videos, blogs, f...",Stack Overflow;How-to videos;Interactive tutorial,...,,,,,,,Too long,Easy,,
4,5,I am a developer by profession,18-24 years old,"Student, full-time",,Apples,,"Secondary school (e.g. American high school, G...","Other online resources (e.g., videos, blogs, f...",Technical documentation;Blogs;Written Tutorial...,...,,,,,,,Too short,Easy,,


### Step 3. Finding and Removing Duplicates
##### Task 1: Identify duplicate rows in the dataset.


### ステップ3。複製を見つけて削除します
##### タスク1：データセット内の重複行を特定します。


In [10]:
## Write your code here
# Task 1: Identify duplicate rows
duplicate_rows = df.duplicated()
print("Number of duplicate rows:", duplicate_rows.sum())


Number of duplicate rows: 0


##### Task 2: Remove the duplicate rows from the dataframe.



In [11]:
## Write your code here
#  Task 2: Remove duplicate rows
df = df.drop_duplicates()
print("Shape of dataframe after removing duplicates:", df.shape)

Shape of dataframe after removing duplicates: (65437, 114)


### Step 4: Finding Missing Values
##### Task 3: Find the missing values for all columns.


In [12]:
## Write your code here

missing_values = df.isnull().sum()
print("Missing values per column:")
print(missing_values)

Missing values per column:
ResponseId                 0
MainBranch                 0
Age                        0
Employment                 0
RemoteWork             10631
                       ...  
JobSatPoints_11        35992
SurveyLength            9255
SurveyEase              9199
ConvertedCompYearly    42002
JobSat                 36311
Length: 114, dtype: int64


##### Task 4: Find out how many rows are missing in the column RemoteWork.


##### タスク4：列のremoteworkに欠落している行の数を調べます。

In [13]:
## Write your code here

# Task 4: Find missing rows in RemoteWork column
remote_work_missing = df['RemoteWork'].isnull().sum()
print("Number of missing values in RemoteWork column:", remote_work_missing)

Number of missing values in RemoteWork column: 10631


### Step 5. Imputing Missing Values
##### Task 5: Find the value counts for the column RemoteWork.


### ステップ5 欠損値の帰属
##### タスク5：列remoteworkの値カウントを見つけます。


In [14]:
## Write your code here

# Task 5: Find value counts for RemoteWork column
remote_work_counts = df['RemoteWork'].value_counts()
print("Value counts for RemoteWork:")
print(remote_work_counts)

Value counts for RemoteWork:
RemoteWork
Hybrid (some remote, some in-person)    23015
Remote                                  20831
In-person                               10960
Name: count, dtype: int64


##### Task 6: Identify the most frequent (majority) value in the RemoteWork column.



##### タスク6：Remotework列で最も頻繁な（過半数）価値を特定します。


In [15]:
## Write your code here

# Task 6: Identify the most frequent value
majority_value = df['RemoteWork'].mode()[0]
print("Most frequent value in RemoteWork:", majority_value)

Most frequent value in RemoteWork: Hybrid (some remote, some in-person)


##### Task 7: Impute (replace) all the empty rows in the column RemoteWork with the majority value.



##### タスク7：列のすべての空の行をremoteworkのすべての空の行を支持します。

In [16]:
## Write your code here

# Task 7: Impute missing values with majority value
df['RemoteWork'].fillna(majority_value, inplace=True)
print("Number of missing values after imputation:", df['RemoteWork'].isnull().sum())

Number of missing values after imputation: 0


##### Task 8: Check for any compensation-related columns and describe their distribution.



##### タスク8：報酬関連の列を確認し、それらの分布について説明します。

In [21]:
## Write your code here

# Task 8: Check compensation-related columns
compensation_columns = [col for col in df.columns if 'comp' in col.lower() or 'salary' in col.lower() or 'income' in col.lower()]
print("Compensation-related columns:", compensation_columns)

# Describe the distribution of compensation columns
for col in compensation_columns:
    print(f"\nDistribution of {col}:")
    print(df[col].describe())


Compensation-related columns: ['CompTotal', 'AIComplex', 'ConvertedCompYearly']

Distribution of CompTotal:
count     3.374000e+04
mean     2.963841e+145
std      5.444117e+147
min       0.000000e+00
25%       6.000000e+04
50%       1.100000e+05
75%       2.500000e+05
max      1.000000e+150
Name: CompTotal, dtype: float64

Distribution of AIComplex:
count                                             37021
unique                                                5
top       Good, but not great at handling complex tasks
freq                                              12102
Name: AIComplex, dtype: object

Distribution of ConvertedCompYearly:
count    2.343500e+04
mean     8.615529e+04
std      1.867570e+05
min      1.000000e+00
25%      3.271200e+04
50%      6.500000e+04
75%      1.079715e+05
max      1.625660e+07
Name: ConvertedCompYearly, dtype: float64


### Summary 


**In this lab, you focused on imputing missing values in the dataset.**

- Use the <code>pandas.read_csv()</code> function to load a dataset from a CSV file into a DataFrame.

- Download the dataset if it's not available online and specify the correct file path.



<!--
## Change Log
|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
|-|-|-|-|
|2024-11-05|1.3|Madhusudhan Moole|Updated lab|
|2024-10-29|1.2|Madhusudhan Moole|Updated lab|
|2024-09-27|1.1|Madhusudhan Moole|Updated lab|
|2024-09-26|1.0|Raghul Ramesh|Created lab|
--!>


Copyright © IBM Corporation. All rights reserved.
