# Import Librabries
First, Import Libraries such as Numpy and Pandas for Data cleaning and Manipulation.

In [None]:
import numpy as np
import pandas as pd

# Load Dataset
For this Project, We are using a small dataset which consists information on people applying for jobs.Load the dataset using the pandas method pd.read_csv. 

In [None]:
Dataset = pd.read_csv('OneDrive\Desktop\Week 3 - Data Cleaning (Pandas) - 1\Data-cleaning-for-beginners-using-pandas.csv')
print(Dataset.head())

# Handling Missing Values
Now, After Loading the Dataset we have to check for the missing values and deal with them. Generally, If the dataset is large we remove the missing or Null values from the dataset but In this case the dataset is quite small so instead of removing them we have to deal with them in another way.

In [None]:
Dataset.info()

In [None]:
Dataset.isnull().sum()


As there are 7 missing values in age column, we can impute the missing values with the help of mean or median.First,We have to convert the age column into numeric format then we can calculate the median of the age column.

In [None]:
Dataset['Age'] = pd.to_numeric(Dataset['Age'],errors='coerce')
median_age = Dataset['Age'].median()
Dataset['Age'].fillna(median_age,inplace=True)



In [None]:
print(Dataset)

# Data Cleaning / Transformation
Now, As you can see that the Salary column have a very inconsistent format so we have to perform some standardization and formatting.The current format appears to include dollar signs ('$'), 'k' for thousands, and a range specified as "low-high". We can change the range representation of salary into average of that range so that we can perform a consistent analysis.

In [None]:
Dataset['Salary'] = Dataset['Salary'].replace({'\$': '', 'k': '000', '-': '-'},regex=True)
def calculate_average_salary(Salary):
    if isinstance(Salary, str) and '-' in Salary:
        lower, upper = map(float, Salary.split('-'))
        return (lower + upper) / 2
    return Salary
Dataset['Salary'] = Dataset['Salary'].apply(calculate_average_salary)

In [None]:
print(Dataset)

Now, If we look at the location column we can see inconsistency with location name like there are some spaces between Name and the abbreviation, some comma between name and abbreviations. This create problem when doing the analysis so it's better to remove the white spaces, comma's , abbreviations so that all the values becomes consistent. 

In [None]:
Dataset['Location'] = Dataset['Location'].str.replace(',', '').str.strip()
abbreviations_to_remove = ['In', 'Ny', 'Aus']
for abbreviation in abbreviations_to_remove:
     Dataset['Location'] = Dataset['Location'].apply(lambda x: x[:-len(abbreviation)] if x.endswith(abbreviation) else x).str.strip()

Rating column consists of rating of the company out of 10 but if we look at the entries there are some entries with the values '-1' which does not any sense because there are no rating below 0 so we can assume that '-1' depicts that the ratings for these companies are not available. Droping these entries will affect the analysis because of the size of dataset. We can replace these values with the mean or median of the column.

In [None]:
median_rating = Dataset['Rating'].median()
Dataset['Rating'].replace(-1,median_rating,inplace=True)
Dataset['Rating'].fillna(median_rating,inplace=True)

Established year column have some entries with values '-1' as well but we can not replace with mean or median because the values are years format so we can assume that the establishment year of these companies are not available so '-1' are used as placeholders. We can replace these values with 'Not available'.

In [None]:
Dataset['Established'].replace(-1,'Not available',inplace=True)

Easy Apply column is a boolean type so all the values are in True and False so we can assume that the entries with the values '-1' are False. We can simply replace the '-1' values with False.

In [None]:
Dataset['Easy Apply'] = Dataset['Easy Apply'].replace([-1, '-1', 'Not Available', 'Not Applicable', 'False'], False, regex=True)

# Handling Outliers
Now, We have to check for the outliers in columns like Age, Rating, Salary because extreme low or high values might affect our analysis. So we have to make sure that all the values are lies with specific range.

In [None]:
import seaborn as sns
sns.boxplot(x=Dataset['Age'])

We can find the outliers in 'Salary' column by calcualting the z-score. If the values corresponds false that means it is not a outlier but if it is true then it is a outlier.

In [None]:
from scipy.stats import zscore
z_scores = zscore(Dataset['Salary'])
outliers = (z_scores > 3) | (z_scores < -3)
print(outliers)