#
### Introduction

In this Jupyter Notebook, we aim to clean and prepare a dataset containing information about data science job posts from Glassdoor.It encompasses various industries, job titles,  estimated salaries, Type of ownerships, locations, etc. Our  goal is to perform insightful analysis on the cleaned dataset.

In [472]:
import numpy as np
import pandas as pd
import os
import re
from functions import *

In [473]:
path='/Users/leilajavanmardi/Desktop/Leila/Coding_IronHack/Data_Analytics_Bootcamp/week3/project1/data/raw/Uncleaned_DS_jobs.csv'
df = pd.read_csv(path)

#
### Initial Analysis

We will begin our analysis by examining the current state of the uncleaned dataset. This initial exploration will provide us with valuable insights into the structure, content, and quality of the data. Through this process, we'll identify any issues or inconsistencies that need to be addressed during the cleaning phase.

In [474]:
shape=df.shape
print(f'The original data set consists of  {shape[0]} rows and {shape[1]} columns')
print("Let's take a closer look at the dataset")
print('\nThe following are the first 3 rows of the dataset:\n')
display(df.head(3))
print("\nLet's now examine the last rows of the dataset.\n")
df.tail(3)

The original data set consists of  672 rows and 15 columns
Let's take a closer look at the dataset

The following are the first 3 rows of the dataset:



Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
0,0,Sr Data Scientist,$137K-$171K (Glassdoor est.),Description\n\nThe Senior Data Scientist is re...,3.1,Healthfirst\n3.1,"New York, NY","New York, NY",1001 to 5000 employees,1993,Nonprofit Organization,Insurance Carriers,Insurance,Unknown / Non-Applicable,"EmblemHealth, UnitedHealth Group, Aetna"
1,1,Data Scientist,$137K-$171K (Glassdoor est.),"Secure our Nation, Ignite your Future\n\nJoin ...",4.2,ManTech\n4.2,"Chantilly, VA","Herndon, VA",5001 to 10000 employees,1968,Company - Public,Research & Development,Business Services,$1 to $2 billion (USD),-1
2,2,Data Scientist,$137K-$171K (Glassdoor est.),Overview\n\n\nAnalysis Group is one of the lar...,3.8,Analysis Group\n3.8,"Boston, MA","Boston, MA",1001 to 5000 employees,1981,Private Practice / Firm,Consulting,Business Services,$100 to $500 million (USD),-1



Let's now examine the last rows of the dataset.



Unnamed: 0,index,Job Title,Salary Estimate,Job Description,Rating,Company Name,Location,Headquarters,Size,Founded,Type of ownership,Industry,Sector,Revenue,Competitors
669,669,Data Scientist,$105K-$167K (Glassdoor est.),Join a thriving company that is changing the w...,-1.0,AccessHope,"Irwindale, CA",-1,-1,-1,-1,-1,-1,-1,-1
670,670,Data Scientist,$105K-$167K (Glassdoor est.),100 Remote Opportunity As an AINLP Data Scient...,5.0,ChaTeck Incorporated\n5.0,"San Francisco, CA","Santa Clara, CA",1 to 50 employees,-1,Company - Private,Advertising & Marketing,Business Services,$1 to $5 million (USD),-1
671,671,Data Scientist,$105K-$167K (Glassdoor est.),Description\n\nThe Data Scientist will be part...,2.7,1-800-Flowers\n2.7,"New York, NY","Carle Place, NY",1001 to 5000 employees,1976,Company - Public,Wholesale,Business Services,$1 to $2 billion (USD),-1


In [475]:
col_df=df.columns
print(col_df)

Index(['index', 'Job Title', 'Salary Estimate', 'Job Description', 'Rating',
       'Company Name', 'Location', 'Headquarters', 'Size', 'Founded',
       'Type of ownership', 'Industry', 'Sector', 'Revenue', 'Competitors'],
      dtype='object')


In [476]:
type=df.dtypes
for col in col_df:
    print(f' the type of the column {col} is: {type.loc[col]}')

 the type of the column index is: int64
 the type of the column Job Title is: object
 the type of the column Salary Estimate is: object
 the type of the column Job Description is: object
 the type of the column Rating is: float64
 the type of the column Company Name is: object
 the type of the column Location is: object
 the type of the column Headquarters is: object
 the type of the column Size is: object
 the type of the column Founded is: int64
 the type of the column Type of ownership is: object
 the type of the column Industry is: object
 the type of the column Sector is: object
 the type of the column Revenue is: object
 the type of the column Competitors is: object


In [477]:
df.nunique()

index                672
Job Title            172
Salary Estimate       30
Job Description      489
Rating                32
Company Name         432
Location             207
Headquarters         229
Size                   9
Founded              103
Type of ownership     13
Industry              58
Sector                23
Revenue               14
Competitors          108
dtype: int64

In [478]:
null_values_origin=df.isna().sum()
for col in col_df:
    print(f' the missing values of the column {col} : {null_values_origin.loc[col]}')

 the missing values of the column index : 0
 the missing values of the column Job Title : 0
 the missing values of the column Salary Estimate : 0
 the missing values of the column Job Description : 0
 the missing values of the column Rating : 0
 the missing values of the column Company Name : 0
 the missing values of the column Location : 0
 the missing values of the column Headquarters : 0
 the missing values of the column Size : 0
 the missing values of the column Founded : 0
 the missing values of the column Type of ownership : 0
 the missing values of the column Industry : 0
 the missing values of the column Sector : 0
 the missing values of the column Revenue : 0
 the missing values of the column Competitors : 0


#####
#### Relevant Data Quality Issues

As observed, the uncleaned dataset encompasses several columns containing values in string, integer, and float formats. The most relevant data quality issues and key characteristics are:

- __Mixed Case:__ The values exhibit mixed cases, including uppercase, lowercase, and title case formats.
- __Inconsistent Naming Conventions:__ Inconsistencies are observed in the entered values, such as the usage of abbreviations or acronyms.
- __Variability in Terminology:__ Different values may describe similar content using varying terminology, leading to potential inconsistencies.
- __Special Characters and Numbers:__ Some values include special characters, numbers, or additional information alongside the main content.
- __Potential Errors and Non-Relevant Information:__ Some entries include irrelevant details or information that do not contribute to the dataset's intended analysis
- __Placeholder Values__: At first glance, the dataset appears to have no missing values. However, upon closer inspection, it becomes evident that missing values are represented by the placeholder '-1'. Addressing this is essential to accurately represent missing data and ensure data integrity, especially for our numerical columns such as rating and founded.

While a comprehensive analysis of the entire dataset has been conducted, only a few samples are displayed below to illustrate the points mentioned above:

In [479]:
df.Industry.value_counts()

Industry
-1                                          71
Biotech & Pharmaceuticals                   66
IT Services                                 61
Computer Hardware & Software                57
Aerospace & Defense                         46
Enterprise Software & Network Solutions     43
Consulting                                  38
Staffing & Outsourcing                      36
Insurance Carriers                          28
Internet                                    27
Advertising & Marketing                     23
Health Care Services & Hospitals            21
Research & Development                      17
Federal Agencies                            16
Investment Banking & Asset Management       13
Banks & Credit Unions                        8
Lending                                      8
Energy                                       5
Consumer Products Manufacturing              5
Telecommunications Services                  5
Insurance Agencies & Brokerages              4
Food

In [480]:
df['Job Title'].unique()

array(['Sr Data Scientist', 'Data Scientist',
       'Data Scientist / Machine Learning Expert',
       'Staff Data Scientist - Analytics',
       'Data Scientist - Statistics, Early Career', 'Data Modeler',
       'Experienced Data Scientist', 'Data Scientist - Contract',
       'Data Analyst II', 'Medical Lab Scientist',
       'Data Scientist/Machine Learning', 'Human Factors Scientist',
       'Business Intelligence Analyst I- Data Insights',
       'Data Scientist - Risk', 'Data Scientist-Human Resources',
       'Senior Research Statistician- Data Scientist', 'Data Engineer',
       'Associate Data Scientist', 'Business Intelligence Analyst',
       'Senior Analyst/Data Scientist', 'Data Analyst',
       'Machine Learning Engineer', 'Data Analyst I',
       'Scientist - Molecular Biology',
       'Computational Scientist, Machine Learning',
       'Senior Data Scientist', 'Jr. Data Engineer',
       'E-Commerce Data Analyst', 'Data Analytics Engineer',
       'Product Data Scient

In [481]:
df.Industry.unique()

array(['Insurance Carriers', 'Research & Development', 'Consulting',
       'Electrical & Electronic Manufacturing', 'Advertising & Marketing',
       'Computer Hardware & Software', 'Biotech & Pharmaceuticals',
       'Consumer Electronics & Appliances Stores',
       'Enterprise Software & Network Solutions', 'IT Services', 'Energy',
       'Chemical Manufacturing', 'Federal Agencies', 'Internet',
       'Health Care Services & Hospitals',
       'Investment Banking & Asset Management', 'Aerospace & Defense',
       'Utilities', '-1', 'Express Delivery Services',
       'Staffing & Outsourcing', 'Insurance Agencies & Brokerages',
       'Consumer Products Manufacturing', 'Industrial Manufacturing',
       'Food & Beverage Manufacturing', 'Banks & Credit Unions',
       'Video Games', 'Shipping', 'Telecommunications Services',
       'Lending', 'Cable, Internet & Telephone Providers', 'Real Estate',
       'Venture Capital & Private Equity', 'Miscellaneous Manufacturing',
       'Oil 

In [482]:
df['Salary Estimate'].unique

<bound method Series.unique of 0      $137K-$171K (Glassdoor est.)
1      $137K-$171K (Glassdoor est.)
2      $137K-$171K (Glassdoor est.)
3      $137K-$171K (Glassdoor est.)
4      $137K-$171K (Glassdoor est.)
                   ...             
667    $105K-$167K (Glassdoor est.)
668    $105K-$167K (Glassdoor est.)
669    $105K-$167K (Glassdoor est.)
670    $105K-$167K (Glassdoor est.)
671    $105K-$167K (Glassdoor est.)
Name: Salary Estimate, Length: 672, dtype: object>

#####
### Data Cleaning Process
Following the identification of relevant data quality issues and key characteristics, the next step involves cleaning the data to address these issues and ensure its suitability for analysis. In this section, I will outline the steps taken to clean the dataset using methods such as regular expressions (regex) and tailored functions for specific columns. These strategies were employed to remove inconsistencies, standardize formats, and address missing or irrelevant information efficiently. The cleaning process aims to enhance the quality and integrity of the dataset, laying the groundwork for accurate and reliable analysis.


In [483]:
df_clean=general_cleaning(df)
df_clean.shape
print(f'After uniformating column names and removing duplicates, the data set has {df_clean.shape[0]} rows and {df_clean.shape[1]} columns')

After uniformating column names and removing duplicates, the data set has 672 rows and 15 columns


#####
#### Placeholder: Replacing -1 with numpy.nan
Given that the dataset is about data science job postings on Glassdoor and -1 values represent missing data, in the first step, we replace -1 with numpy.nan. Such a strategy appears to be a suitable first step considering the context and objectives of the analysis and gives us a good glance at the amount of missing values present.


In [484]:
print(' To provide the missing indicator, the -1 values are replaced with numpy.nan')
df_clean.replace({-1:np.nan,'-1': np.nan}, inplace=True)

 To provide the missing indicator, the -1 values are replaced with numpy.nan


In [485]:
columns=df_clean.columns
null_values=df_clean.isna().sum()
for col in columns:
    print(f' the missing values of the column {col} : {null_values.loc[col]}')

 the missing values of the column index : 0
 the missing values of the column job_title : 0
 the missing values of the column salary_estimate : 0
 the missing values of the column job_description : 0
 the missing values of the column rating : 50
 the missing values of the column company_name : 0
 the missing values of the column location : 0
 the missing values of the column headquarters : 31
 the missing values of the column size : 27
 the missing values of the column founded : 118
 the missing values of the column type_of_ownership : 27
 the missing values of the column industry : 71
 the missing values of the column sector : 71
 the missing values of the column revenue : 27
 the missing values of the column competitors : 501


After replacing the placeholder '-1' with numpy.nan, there are missing values present across multiple columns in the dataset. Notably, the <b>"competitors"</b> exhibits a notable number of missing values (501), followed by <b>"founded"</b> (118).
This observation underscores the importance of preprocessing to ensure accuracy in subsequent analyses. By effectively handling missing values in the cleaning process, we can enhance the robustness of our analysis, enabling more accurate interpretations. In the next steps, it will be decided for each column separately how the missing values should be filled.
On the other hand <b>job_title, salary_estimate, company_name</b> and <b>location</b> columns have <b>no missing values</b>.

#####
#### Tailored functions for cleaing the columns
The data cleaning process for the job title, industry, salary, location, headquarters, and company name columns involved several steps to standardize and normalize the values for consistency and ease of analysis. Initially, the unique values and their frequency counts were examined. Next, the uncleaned dataset was analyzed to identify common patterns and variations within each column. Subsequently, custom cleaning functions were developed for each column to effectively address these variations and reduce the number of unique values. Throughout this process, we ensured uniformity and removed any irrelevant characters.<br>
In the next step, a dictionary of patterns was defined, where keys represent regular expression patterns matching the original content of each column, and values represent the corresponding standardized and uniformed content. When addressing missing values and irrelevant data, for non-numeric columns, any unmatched values are categorized as 'others'. However, for numeric columns, missing values were individually replaced to ensure data integrity and accuracy.<br>
This cleaning strategy aims to standardize the data by identifying common patterns and replacing them with predefined standardized values to handle various variations. By doing so, it ensures a reduction in the number of unique values and enables consistency in the representation of values within each column. This, in turn, makes it easier to analyze and interpret the data accurately.

In [486]:
df_clean['job_cleaned']=df_clean.job_title.apply(cleaning_job_title)
print(f"\nIn the original dataset the job titles had \033[1m {df['Job Title'].nunique()}\033[0m uniqe values")
print(f'After uniformating and recategorizing the Jobs titles and handling the irrelevant information,\nwe were able to reduce the amount of the unique values.\n\nThe job titles are now categorized in the follwoing \033[1m{df_clean.job_cleaned.nunique()}\033[0m groups:')
print(df_clean.job_cleaned.unique())


In the original dataset the job titles had [1m 172[0m uniqe values
After uniformating and recategorizing the Jobs titles and handling the irrelevant information,
we were able to reduce the amount of the unique values.

The job titles are now categorized in the follwoing [1m8[0m groups:
['data_scientist' 'others' 'data_analyst' 'data_engineer'
 'machine_learning_engineer' 'senior_data_scientist'
 'data_science_analytics_leadership' 'computational_scientist']


In [487]:
df_clean['indu_cl'] = df_clean['industry'].apply(cleaning_industry)
print(f"\nIn the original dataset the industry column had \033[1m{df['Industry'].nunique()}\033[0m uniqe values")
print(f'\nAfter uniformating and recategorizing the Jobs titles and handling the irrelevant information,\nwe were able to reduce the amount of the unique values.\nThe job titles are now categorized in the follwoing \033[1m{df_clean.indu_cl.nunique()}\033[0m groups:')
print(f'{df_clean.indu_cl.unique()}\n')


In the original dataset the industry column had [1m58[0m uniqe values

After uniformating and recategorizing the Jobs titles and handling the irrelevant information,
we were able to reduce the amount of the unique values.
The job titles are now categorized in the follwoing [1m16[0m groups:
['insurance_agencies' 'others' 'consulting' 'manufacturing'
 'advertising_marketing' 'computer_hardware_software'
 'enterprise_software_network_solutions ' 'energy_and_utilities'
 'government_public_sector' 'internet_telephone_providers'
 'healthcare_and_pharmaceuticals' 'finance_and_banking'
 'staffing_outsourcing' 'transportation_logistics'
 'telecommunications_services' 'real_estate_construction']




##### Cleanign the locations and the headquarters
Initially, my approach involved removing the abbreviations present in each value by splitting the values and selecting index 0 (the code for this is marked in the function as a comment). However, upon closer inspection, I realized that these abbreviations correspond to different geographical locations. For instance, "Columbia, MD" refers to Columbia, Maryland, "Columbia, MO" to Columbia, Missouri, and "Columbia, SC" to Columbia, South Carolina.

Therefore, I made the decision not to remove the abbreviations. An alternative strategy could have been to remove the abbreviations and rename locations with the same name but different geographical identifiers. However, this approach did not directly impact the analysis, and it would not have reduced the number of unique values. Moreover, accurately assigning the abbreviations without explicit information would have been challenging, requiring additional research to understand which abbreviation corresponds to which geographical location. As a result, the abbreviations remained in the final dataset.

In [488]:
df_clean['location_cleaned']=df_clean.location.apply(cleaning_locations)
print(f'\nthe location had \033[1m{df.Location.isna().sum()}\033[0m missing values in the original dataset.')
print(f'After the cleanign process,the location column has \033[1m{df_clean.location_cleaned.nunique()}\033[0m unique values.\n')


the location had [1m0[0m missing values in the original dataset.
After the cleanign process,the location column has [1m207[0m unique values.



In [489]:
df_clean['headquarters_cleaned']=df_clean.headquarters.apply(cleaning_headquarters)

##### Cleanign the salary estimates
The cleaning process for the salary column involves several steps aimed at standardizing and normalizing the salary values. Initially, all values are cast to strings to ensure uniformity. Next, the '$' and 'K' symbols are removed to clean the formatting. Following this, any parentheses and spaces are removed from the string using regex substitution.
<br>Subsequently, the process involves identifying and replacing repeated patterns within the salary values. This is achieved using a predefined dictionary containing regular expressions as keys and their corresponding replacement values. For instance, patterns such as '56-97', '66-112', '69-116', '71-123', or '79-106' are replaced with '56-125', representing a broader salary range.
The function is designed to assigns the missing data as 'not_verified' however, there were no missing values in this column. Overall, the cleaning process ensures consistency in the representation of salary values and reduces the unique values making it easier to analyze and interpret the data accurately.

In [490]:
df_clean['sal_cleaned']=df_clean.salary_estimate.map(cleaning_salary)
print(f"\nIn the original dataset the salaries had \033[1m{df['Salary Estimate'].nunique()}\033[0m uniqe values")
print(f'After uniformating,the salaries are now categorized in the follwoing \033[1m{df_clean.sal_cleaned.nunique()}\033[0m groups:')
print(df_clean.sal_cleaned.unique())


In the original dataset the salaries had [1m30[0m uniqe values
After uniformating,the salaries are now categorized in the follwoing [1m7[0m groups:
['120-200' '75-145' '90-170' '56-125' '140-225' '30-56' '210-335']


In [491]:
df_clean['revenue_cleaned']=df_clean.revenue.str.lower().str.strip().str.replace('(usd)','')
df_clean.revenue_cleaned = df_clean.revenue_cleaned.str.replace('unknown / non-applicable','unknown_non_applicable')
df_clean.revenue_cleaned =df_clean.revenue_cleaned.fillna('unknown_non_applicable')
print(f'\nAfter uniformating, the reveneus are now categorized in the bellow \033[1m{df_clean.revenue_cleaned.nunique()}\033[0m groups,\nwhile the missing or nonapplicable values were assigned as unknown / non-applicable:')
print(df_clean.revenue_cleaned.unique())


After uniformating, the reveneus are now categorized in the bellow [1m13[0m groups,
while the missing or nonapplicable values were assigned as unknown / non-applicable:
['unknown_non_applicable' '$1 to $2 billion ' '$100 to $500 million '
 '$10+ billion ' '$2 to $5 billion ' '$500 million to $1 billion '
 '$5 to $10 billion ' '$10 to $25 million ' '$25 to $50 million '
 '$50 to $100 million ' '$1 to $5 million ' '$5 to $10 million '
 'less than $1 million ']


In [492]:
df_clean['size_cleaned']= df_clean['size'].fillna('unknown').replace('Unknown','unknown')
print(df_clean['size_cleaned'].isna().sum())

0


In [493]:
df_clean['company_name_cleaned']=df_clean.company_name.apply(cleaning_companies)

In [494]:
df_clean.sector.unique()
df_clean['sector_cleaned']=df_clean.sector.str.lower().str.strip().str.replace(' & ','_&_')
df_clean.sector_cleaned = df_clean.sector_cleaned.fillna('unknown')
print(f'\nAfter uniformating, the sectors are now categorized in the follwoing \033[1m{df_clean.sector_cleaned.nunique()}\033[0m groups.\n')


After uniformating, the sectors are now categorized in the follwoing [1m23[0m groups.



In [495]:
df_clean['rating_cleaned'] = df_clean.rating.astype(str)
df_clean.rating_cleaned = df_clean.rating_cleaned.str.strip()
df_clean.rating_cleaned.unique()
df_clean['rating_cleaned']=df_clean.rating_cleaned.astype(float)

In [496]:
df_clean['type_of_ownership_cleaned']=df_clean.type_of_ownership.str.lower().str.strip().str.replace(' - ','_').str.replace(' / ','_').str.replace(' ','_')
df_clean['type_of_ownership_cleaned']=df_clean['type_of_ownership_cleaned'].fillna('unknown')
print(df_clean.type_of_ownership_cleaned.unique())

['nonprofit_organization' 'company_public' 'private_practice_firm'
 'company_private' 'government' 'subsidiary_or_business_segment'
 'other_organization' 'unknown' 'hospital' 'self-employed'
 'college_university' 'contract']


In [497]:
print(df_clean.founded.isna().sum())
print(df_clean.founded.unique())
df_clean['founded_clean']= df_clean.founded.fillna(0).astype(int)
print(df_clean.founded_clean.unique())

118
[1993. 1968. 1981. 2000. 1998. 2010. 1996. 1990. 1983. 2014. 2012. 2016.
 1965. 1973. 1986. 1997. 2015. 1945. 1988. 2017. 2011. 1967. 1860. 1992.
 2003. 1951. 2005. 2019. 1925. 2008. 1999. 1978. 1966. 1912. 1958. 2013.
 1849. 1781. 1926. 2006. 1994. 1863. 1995.   nan 1982. 1974. 2001. 1985.
 1913. 1971. 1911. 2009. 1959. 2007. 1939. 2002. 1961. 1963. 1969. 1946.
 1957. 1953. 1948. 1850. 1851. 2004. 1976. 1918. 1954. 1947. 1955. 2018.
 1937. 1917. 1935. 1929. 1820. 1952. 1932. 1894. 1960. 1788. 1830. 1984.
 1933. 1880. 1887. 1970. 1942. 1980. 1989. 1908. 1853. 1875. 1914. 1898.
 1956. 1977. 1987. 1896. 1972. 1949. 1962.]
[1993 1968 1981 2000 1998 2010 1996 1990 1983 2014 2012 2016 1965 1973
 1986 1997 2015 1945 1988 2017 2011 1967 1860 1992 2003 1951 2005 2019
 1925 2008 1999 1978 1966 1912 1958 2013 1849 1781 1926 2006 1994 1863
 1995    0 1982 1974 2001 1985 1913 1971 1911 2009 1959 2007 1939 2002
 1961 1963 1969 1946 1957 1953 1948 1850 1851 2004 1976 1918 1954 1947
 1955 2018 19

In [498]:
col_to_drop =['job_title', 'industry', 'salary_estimate','location','size', 'headquarters', 'company_name','sector', 'revenue', 'rating','type_of_ownership','founded']
df_clean.drop(columns=col_to_drop, inplace=True)
print(df_clean.columns)

Index(['index', 'job_description', 'competitors', 'job_cleaned', 'indu_cl',
       'location_cleaned', 'headquarters_cleaned', 'sal_cleaned',
       'revenue_cleaned', 'size_cleaned', 'company_name_cleaned',
       'sector_cleaned', 'rating_cleaned', 'type_of_ownership_cleaned',
       'founded_clean'],
      dtype='object')


#
#### Final clean Dataset 
Initially, in alignment with our research objectives and to address identified issues, two additional columns were removed, followed by a renaming of all remaining columns.

In [499]:
df_final=df_clean.drop(columns=['job_description','competitors'])
df_final.columns

Index(['index', 'job_cleaned', 'indu_cl', 'location_cleaned',
       'headquarters_cleaned', 'sal_cleaned', 'revenue_cleaned',
       'size_cleaned', 'company_name_cleaned', 'sector_cleaned',
       'rating_cleaned', 'type_of_ownership_cleaned', 'founded_clean'],
      dtype='object')

In [500]:
new_column_names = {
    'index': 'index',
    'job_cleaned': 'job_title',
    'sal_cleaned': 'salary_range_$_K',
    'location_cleaned': 'location',
    'size_cleaned': 'size',
    'founded_clean':'founded',
    'headquarters_cleaned': 'headquarters',
    'company_name_cleaned': 'company_name',
    'sector_cleaned': 'sector',
    'revenue_cleaned': 'revenue',
    'rating_cleaned': 'rating',
    'type_of_ownership_cleaned': 'type_of_ownership',
    'indu_cl': 'industry',    
}

# Renaming the columns and their reorder in the final dataset
df_final = df_final.rename(columns=new_column_names)
df_final = df_final[new_column_names.values()]
df_final.sample(3)

Unnamed: 0,index,job_title,salary_range_$_K,location,size,founded,headquarters,company_name,sector,revenue,rating,type_of_ownership,industry
270,270,data_engineer,90-170,"santa clara, ca",51 to 200 employees,2011,"mountain view, ca",shape security,information technology,unknown_non_applicable,4.1,company_private,others
645,645,machine_learning_engineer,90-170,"santa clara, ca",10000+ employees,1976,"cupertino, ca",apple,information technology,$10+ billion,4.1,company_public,computer_hardware_software
620,620,data_scientist,75-145,united states,5001 to 10000 employees,1951,"san diego, ca",cubic,aerospace_&_defense,$1 to $2 billion,3.3,company_public,others


In [501]:
df_final.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 672 entries, 0 to 671
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   index              672 non-null    int64  
 1   job_title          672 non-null    object 
 2   salary_range_$_K   672 non-null    object 
 3   location           672 non-null    object 
 4   size               672 non-null    object 
 5   founded            672 non-null    int64  
 6   headquarters       672 non-null    object 
 7   company_name       672 non-null    object 
 8   sector             672 non-null    object 
 9   revenue            672 non-null    object 
 10  rating             622 non-null    float64
 11  type_of_ownership  672 non-null    object 
 12  industry           672 non-null    object 
dtypes: float64(1), int64(2), object(10)
memory usage: 68.4+ KB


#
### Data Analysis Process

This analysis intends to to provide a comprehensive snapshot of the current Data Science job market, using the job postings on Glassdoor as the case study.
The primary aim is to uncover key insights into the job market and explore aspects such as salary estimates, company sizes,  revenue.It aims for find some pattern to answer the follwing questions: 
- Are there certain industries or sectors with high demand for data science roles? and what are the most common job titles required?
- Is there a correlation between company size, revenue, and salary estimation?
- Are there geographic regions where the open positions are more prevalent?
- and wheatehr any factors that contribute to higher job ratings can be identified?