## **Data Wrangling**

Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear
 description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.


In [2]:
import pandas as pd
job_data = pd.read_csv("./datasets/DS_jobs.csv")
print(job_data.head())

   index          Job Title               Salary Estimate  \
0      0  Sr Data Scientist  $137K-$171K (Glassdoor est.)   
1      1     Data Scientist  $137K-$171K (Glassdoor est.)   
2      2     Data Scientist  $137K-$171K (Glassdoor est.)   
3      3     Data Scientist  $137K-$171K (Glassdoor est.)   
4      4     Data Scientist  $137K-$171K (Glassdoor est.)   

                                     Job Description  Rating  \
0  Description\n\nThe Senior Data Scientist is re...     3.1   
1  Secure our Nation, Ignite your Future\n\nJoin ...     4.2   
2  Overview\n\n\nAnalysis Group is one of the lar...     3.8   
3  JOB DESCRIPTION:\n\nDo you have a passion for ...     3.5   
4  Data Scientist\nAffinity Solutions / Marketing...     2.9   

              Company Name       Location            Headquarters  \
0         Healthfirst\n3.1   New York, NY            New York, NY   
1             ManTech\n4.2  Chantilly, VA             Herndon, VA   
2      Analysis Group\n3.8     Boston, MA

In [3]:
print(job_data.describe())

            index      Rating      Founded
count  672.000000  672.000000   672.000000
mean   335.500000    3.518601  1635.529762
std    194.133974    1.410329   756.746640
min      0.000000   -1.000000    -1.000000
25%    167.750000    3.300000  1917.750000
50%    335.500000    3.800000  1995.000000
75%    503.250000    4.300000  2009.000000
max    671.000000    5.000000  2019.000000


In [None]:

cols_with_nan = job_data.columns[job_data.isnull().any()].tolist()


print(f"No of columns with Nan: {len(cols_with_nan)}" if cols_with_nan else "No missing values")
print(f"\nColumns{" ":<12}Datatypes\n"+'*'*28)
print(job_data.dtypes)
print('*'*28+'\n')
print(f"Dimension of Job dataset is:- {job_data.shape}")

No missing values

Columns            Datatypes
****************************
index                  int64
Job Title                str
Salary Estimate          str
Job Description          str
Rating               float64
Company Name             str
Location                 str
Headquarters             str
Size                     str
Founded                int64
Type of ownership        str
Industry                 str
Sector                   str
Revenue                  str
Competitors              str
dtype: object
****************************

Dimension of Job dataset is:- (672, 15)


In [11]:
# Extracting min-max Salary Estimate and converting them to numeric
salary_text = job_data['Salary Estimate'].str.replace('K', '').str.replace('$', '')
job_data[['min_salary', 'max_salary']] = salary_text.str.extract(r'(\d+)-(\d+)')
job_data['min_salary'] = pd.to_numeric(job_data['min_salary'])
job_data['max_salary'] = pd.to_numeric(job_data['max_salary'])

#
job_data[['city', 'state']] = job_data['Location'].str.extract(r'(\w+), (\w+)')
print(job_data['city'].isnull())
print(job_data['state'].isnull())


0      False
1      False
2      False
3      False
4      False
       ...  
667    False
668    False
669    False
670    False
671    False
Name: city, Length: 672, dtype: bool
0      False
1      False
2      False
3      False
4      False
       ...  
667    False
668    False
669    False
670    False
671    False
Name: state, Length: 672, dtype: bool
