# Phase 1 — Software Engineer Salaries (Data Import, Cleaning, Transformation)

This notebook reproduces the steps used to import, clean, and transform the uploaded raw dataset `Software Engineer Salaries.csv`.

Files produced:
- `Cleaned_Software_Engineer_Salaries.csv`
- `Phase1_documentation.md`


In [5]:
# Step 1: Import Libraries
import pandas as pd   # Pandas library helps us handle CSV/Excel data easily

# Step 2: Load Dataset
# Agar tumhara dataset notebook ke same folder mein hai:
df = pd.read_csv('Software Engineer Salaries.csv')

# Step 3: Check basic info
print("Rows & Columns:", df.shape)  # tells you how many rows and columns are in the data
df.head()  # shows first 5 rows of the data


Rows & Columns: (870, 6)


Unnamed: 0,Company,Company Score,Job Title,Location,Date,Salary
0,ViewSoft,4.8,Software Engineer,"Manassas, VA",8d,$68K - $94K (Glassdoor est.)
1,Workiva,4.3,Software Support Engineer,Remote,2d,$61K - $104K (Employer est.)
2,"Garmin International, Inc.",3.9,C# Software Engineer,"Cary, NC",2d,$95K - $118K (Glassdoor est.)
3,Snapchat,3.5,"Software Engineer, Fullstack, 1+ Years of Expe...","Los Angeles, CA",2d,$97K - $145K (Employer est.)
4,Vitesco Technologies Group AG,3.1,Software Engineer,"Seguin, TX",2d,$85K - $108K (Glassdoor est.)


In [6]:
# Step 4: Check for missing (empty) values
df.isnull().sum()


Company            2
Company Score     81
Job Title          0
Location          13
Date               0
Salary           106
dtype: int64

## Cleaning steps applied

1. Drop fully empty columns.
2. Drop duplicate rows.
3. Parse salary-like columns into numeric `salary_parsed_raw` (handle ranges and 'K' suffix).
4. Standardize job title and location text fields.
5. Create numeric years of experience where available and bucket into bands.
6. Flag extreme outliers.


In [7]:
# Step 5: Remove duplicate rows (same data again)
print("Before removing duplicates:", df.shape)
df = df.drop_duplicates()
print("After removing duplicates:", df.shape)


Before removing duplicates: (870, 6)
After removing duplicates: (870, 6)


In [9]:
# Step 7: Remove $ and commas from salary column
df['Salary'] = df['Salary'].replace('[\$,]', '', regex=True)
df['Salary'] = pd.to_numeric(df['Salary'], errors='coerce')

# Check cleaned salary
df['Salary'].head()


0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
Name: Salary, dtype: float64

In [10]:
# Step 8: Fill missing data
df['Location'] = df['Location'].fillna('Unknown')
df['Job Title'] = df['Job Title'].fillna('Not Mentioned')


In [13]:
df['Salary'].head(10)


0   NaN
1   NaN
2   NaN
3   NaN
4   NaN
5   NaN
6   NaN
7   NaN
8   NaN
9   NaN
Name: Salary, dtype: float64

In [14]:
import pandas as pd   # Pandas library helps us handle CSV/Excel data easily

# Step 2: Load Dataset
# Agar tumhara dataset notebook ke same folder mein hai:
df = pd.read_csv('Software Engineer Salaries.csv')

Columns: ['Company', 'Company Score', 'Job Title', 'Location', 'Date', 'Salary']

Sample values (first 30) in 'Salary' column:
['$68K - $94K\xa0(Glassdoor est.)', '$61K - $104K\xa0(Employer est.)', '$95K - $118K\xa0(Glassdoor est.)', '$97K - $145K\xa0(Employer est.)', '$85K - $108K\xa0(Glassdoor est.)', '$123K - $175K\xa0(Employer est.)', '$77K - $94K\xa0(Glassdoor est.)', '$71K - $100K\xa0(Glassdoor est.)', '$94K - $148K\xa0(Glassdoor est.)', '$147K - $189K\xa0(Employer est.)', '$90K - $113K\xa0(Employer est.)', '$54K - $79K\xa0(Glassdoor est.)', '$91K - $135K\xa0(Glassdoor est.)', '$70K - $135K\xa0(Employer est.)', '$192K - $288K\xa0(Employer est.)', '$100K - $125K\xa0(Employer est.)', '$85K - $127K\xa0(Glassdoor est.)', '$69K - $109K\xa0(Glassdoor est.)', '$124K - $234K\xa0(Employer est.)', '$84K - $133K\xa0(Glassdoor est.)', '$85K - $150K\xa0(Employer est.)', '$48K - $89K\xa0(Glassdoor est.)', '$66K - $97K\xa0(Glassdoor est.)', '$108K - $199K\xa0(Employer est.)', '$102K - $173K\xa0