<div style="text-align: center">
   
# **📊 Data Science Jobs Analysis 💼** 
    
</div>


# <u>**Problem Statement**:
<img src="image.jpg" alt="Description" style="width: 70%; height: auto;" />

**The objective of this project is to analyze job listings data scraped from Naukri.com to gain insights into the data science job market 📈. The dataset includes various features such as job titles, companies, required experience, locations, salary ranges 💼, and key skills demanded by employers in the data science field. The primary goal is to understand the key trends 🔍 in job requirements, company ratings, and skill sets to help aspiring data scientists 🧑‍💻 make informed decisions about career growth, skill development, and job opportunities. This project will involve comprehensive data cleaning 🧹, exploratory data analysis 📊, and visualization 🖼️ to uncover valuable insights into the current job market landscape.**


<div style="text-align: center">
   
# **SPRINT 1 - Web Scraping** 
    
</div>

## **Description**


In Sprint 1, we focused on web scraping job listings from Naukri.com to gather valuable insights into the data science job market. We utilized **Selenium**, a powerful web automation tool, to create a script that interacts with web elements dynamically and extracts relevant job data.

The coding procedure involves the following key steps:

1. **Setup**: We initialized the Selenium WebDriver (e.g., ChromeDriver) to launch the web browser and navigate to the Naukri.com job listings pages.

2. **Looping Through Pages**: A loop was implemented to iterate through multiple job listing pages, enabling us to scrape data across different listings efficiently.

3. **Locating Elements**: Using various XPath selectors, we identified and extracted key data points:
   - **Job Title and Company**: Extracted using the XPath `'.//div[@class="row1"]'`.
   - **Job Posting Date**: Retrieved using `'.//span[@class="job-post-day"]'`.
   - **Job Details**: Located through `'.//div[@class="job-details"]'`, which includes experience requirements and location.

4. **Storing Data**: Each extracted element was stored in corresponding lists (e.g., `titles`, `companies`, `days`, etc.) for further analysis.

This foundational data collection process will support our analysis in later sprints, enabling us to uncover valuable insights into the data science job market.

In [2]:
#importing all the required libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
import re
import numpy as np
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.common.exceptions import NoSuchElementException
import time

In [58]:
#Lists to store the extracted data
title=[]
company=[]
ratings=[]
reviews=[]
years=[]
location=[]
days=[]
salary=[]
skills=[]
como=[]
key_skills=[]

#initializing chrome browser
driver=webdriver.Chrome()

#iterating through the pages
for j in range(1,111):
    url = f"https://www.naukri.com/data-science-jobs-{j}"
    driver.get(url)
    time.sleep(5)
    for i in driver.find_elements(By.XPATH , './/div[@class="cust-job-tuple layout-wrapper lay-2 sjw__tuple "]'):
        
        #Extracting job title
        t = i.find_element(By.XPATH , './/div[@class=" row1"]')
        if t.text is None:
            title.append(np.nan)
        else:
            title.append(t.text)
            
        #Extracting the name of the company
        c = i.find_element(By.XPATH , './/div[@class=" row2"]')
        if re.findall("(^\w+.*)\\n\d\.",c.text):
            company.append(''.join(re.findall("(^\w+.*)\\n\d\.",c.text)))
        else:
            company.append(np.nan)
            
        #Extracting the ratings of the company 
        if re.findall("\d\.\d",c.text):
            ratings.append(re.findall("\d\.\d",c.text)[0])
        else:
            ratings.append(np.nan)
            
        #Extracting the reviews
        if re.findall("\d+(?= Reviews)",c.text):
            reviews.append(re.findall("\d+(?= Reviews)",c.text)[0])
        else:
            reviews.append(np.nan)
            
        #Extracting years of experience
        y = i.find_element(By.XPATH , './/div[@class="job-details "]')
        if re.findall("\d\-\d.",y.text):
            years.append(re.findall("\d\-\d.",y.text)[0])
        else:
            years.append(np.nan)
            
        #Extracting the location of the job.
        l = i.find_element(By.XPATH , './/div[@class="job-details "]')
        if re.findall('\\n\w.*\\n(.*)',l.text):
            location.append(''.join(re.findall('\\n\w.*\\n(.*)',l.text)))
        else:
            location.append(np.nan)
            
        #Extracting the number of days ago, the job was posted
        d = i.find_element(By.XPATH , './/span[@class="job-post-day "]')
        if re.findall("\d",y.text):
            days.append(re.findall("\d",y.text)[0])
        elif "Just Now" in y.text:
            years.append(1)
            continue
        else:
            days.append(np.nan)
            
        #Extracting the salary if disclosed
        s = i.find_element(By.XPATH , './/div[@class="job-details "]')
        if re.findall("\\n(.*)\\n",s.text):
            salary.append(''.join(re.findall("\\n(.*)\\n",s.text)))
        else:
            salary.append(np.nan)
            
        #Extracting the skills and then extracting key skills from the skills
        try:
            sk = i.find_element(By.XPATH, './/ul[@class="tags-gt "]')
            skills.append(sk.text) 
        except NoSuchElementException:
            skills.append(np.nan)  
        try:
            sk1=i.find_element(By.XPATH, './/ul[@class="tags-gt "]')
            skill=re.findall(r"""(?i)(data analytics|machine learning|python|matplotlib|seaborn|pandas|excel|sql|numpy|
                  natural language processing|nlp|deep learning|dl|ml|visualization|java|C\+\+|
                  image processing|sas|bet|statistical modelling|data science|data analysis|data mining|
                  data analyst|statistical analysis|data engineer|statistics|big data|predictive modelling|
                  data security|time series analysis|data collection|data engineering|gen ai|cloud services|
                  aws|automation|azure|nosql|mysql|llm|tensorflow|pyspark|ai|artificial intelligence|
                  sap|data processing|power bi|powerbi|business analysis|data management|neural networks)""", sk.text)
            
            key_skills.append(','.join(skill))
        except NoSuchElementException:
            key_skills.append(np.nan)

In [None]:
#Creating the data frame
d0=pd.DataFrame({"title":title,
               "company":company,
                "ratings":ratings,
                "reviews":reviews,
                "years":years,
                "location":location,
                "days":days,
                "salary":salary,
                "skills":skills,
                "key_skills":key_skills})

In [None]:
#converting to csv
d0.to_csv('Nakuri.csv', index=False)

In [82]:
#reading the csv file
df=pd.read_csv('Nakuri.csv')

In [83]:
#dropping unnecessary columns
df.drop('skills',axis=1,inplace=True)

##  <u>**Column Description**</u>

- **title**: The job title or position being advertised (e.g., Data Scientist, Data Analyst, etc.).
- **company**: The name of the company posting the job listing.
- **ratings**: The rating of the company (if available) as given by employees or users.
- **reviews**: The number of reviews left for the company.
- **years**: The required range of years of experience for the job (e.g., 2-5 years).
- **location**: The geographical location of the job (e.g., city or region).
- **days**: The number of days ago the job was posted.
- **salary**: The salary offered for the job (if provided in the listing).
- **key_skills**: Specific technical skills extracted from the job description (e.g., Python, SQL, Machine Learning, etc.).

In [84]:
df.head()

Unnamed: 0,title,company,ratings,reviews,years,location,days,salary,key_skills
0,R&D- Data Science and Analytics Lead,Pepsi Foods,4.1,2308.0,2-5,Hyderabad,2.0,Not disclosed,"Machine learning,big data"
1,Specialist- Data Science & Analytics,Carrier,3.8,419.0,2-5,Bengaluru,2.0,Not disclosed,"ai,data science,Neural networks,Artificial Int..."
2,Data Science Analytics Analyst,Accenture,4.0,49777.0,3-5,Bengaluru,3.0,Not disclosed,"data mining,machine learning,excel,python,data..."
3,Data Science & Analytics Engagement Lead,CANPACK,4.3,236.0,0-15,Pune,1.0,Not disclosed,"sql,data science,data mining,power bi,machine ..."
4,"Assoc Manager, Data Science Analytics",,,,3-6,Bengaluru,3.0,Not disclosed,"excel,python,data mining"


In [85]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2172 entries, 0 to 2171
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   title       2171 non-null   object 
 1   company     1711 non-null   object 
 2   ratings     1711 non-null   float64
 3   reviews     1711 non-null   float64
 4   years       2168 non-null   object 
 5   location    2171 non-null   object 
 6   days        2171 non-null   float64
 7   salary      2171 non-null   object 
 8   key_skills  2139 non-null   object 
dtypes: float64(3), object(6)
memory usage: 152.8+ KB


<div style="text-align: center">
   
# **SPRINT 2 - Data Wrangling** 
    
</div>

### Sprint 2: Data Wrangling

In Sprint 2, we focused on data wrangling to prepare the scraped job listings dataset for analysis. This phase is crucial as it transforms raw data into a clean, structured format, enabling us to derive meaningful insights.

The data wrangling procedure involves several key steps:

1. **Data Cleaning**: We began by addressing missing values, duplicates, and inconsistencies within the dataset. This included removing any job postings that lacked essential information such as job titles or company names.

2. **Data Type Conversion**: We ensured that each column was of the appropriate data type. For example, salary ranges were converted from strings to numerical values, and dates were formatted for easier manipulation.

3. **Feature Engineering**: New features were created to enhance the dataset. For instance, we extracted the minimum and maximum salary from the salary range column and categorized job titles into broader roles (e.g., Data Analyst, Data Scientist).

4. **Text Normalization**: To facilitate analysis, we standardized text fields by converting them to lowercase and removing unnecessary whitespace or special characters, ensuring consistency across the dataset.

5. **Data Transformation**: Finally, we transformed the dataset into a tidy format, organizing it for easy analysis. This included pivoting tables and encoding categorical variables as needed.

By the end of Sprint 2, the dataset was well-prepared, laying a solid foundation for the exploratory data analysis in the subsequent sprint. This comprehensive data wrangling process is essential for ensuring the reliability and accuracy of our findings.

### **Changes made to the data**

- Fixing the salary column .
  1) Filling not disclosed with nan.
  2) converting thousand to lakhs
  3) creating 2 separate columns for staring and ending range of the salary.
  4) Converting data type of minmum and maximum salary.
  5) Dropping the salary column.
- Fixing year column
  1) Creating two columns (min_year, max_year) from the column year.
  2) Converting the data type of the columns.

####  <u>**Fixing salary column**

Filling not disclised values with null

In [86]:

df['salary'] = df['salary'].replace('Not disclosed', np.nan)


Removing all the text from the salary solumn so that it willl be easy to extract the minimum and maximu  salary from the range

In [89]:
df['salary'] = df['salary'].str.replace(' Lacs PA','').str.replace(' Cr and above PA','').str.replace('5 Cr PA','')

In [93]:
df['salary'] = df['salary'].str.replace(' PA','')

Converting thousands to lakhs:<br>
Since there are only two values in thousands, we are fixing it manually, if not we can do it with loops

In [98]:
df['salary'] = df['salary'].str.replace('50,000','0.5').str.replace('80,000','0.8')

Extracting minimum salary and maximum salary from the range

In [123]:
def find_min(x):
    if pd.isna(x):
        return np.nan
    else:
        return str(''.join(re.findall('(\d+.*)\-',x)))

In [124]:
df['min_salary']=df['salary'].apply(lambda x: find_min(x))

In [125]:
df[df['salary'].notna()][['salary','min_salary']]

Unnamed: 0,salary,min_salary
33,10-20,10
46,1-3,1
47,4.5-5.5,4.5
51,18-25,18
70,5-10,5
...,...,...
2072,2.75-5,2.75
2098,22.5-25,22.5
2114,35-40,35
2115,4-8,4


In [134]:
def find_max(x):
    if pd.isna(x):
        return np.nan
    else:
        return str(''.join(re.findall('\-(.*)',x)))

In [135]:
df['max_salary'] = df['salary'].apply(lambda x: find_max(x))

In [136]:
df['max_salary'].unique()

array([nan, '20', '3', '5.5', '25', '10', '22', '22.5', '9', '24', '14',
       '15', '45', '30', '40', '2.', '17', '18', '60', '3.75', '4', '21',
       '5', '35', '4.5', '8', '4.25', '13', '7', '65', '12', '16', '34',
       '17.5', '', '27.5', '32.5', '42.5', '6', '4.75', '1', '11', '9.5',
       '2.5', '70'], dtype=object)

In [139]:
df[df['salary'].notna()][['salary','min_salary','max_salary']]

Unnamed: 0,salary,min_salary,max_salary
33,10-20,10,20
46,1-3,1,3
47,4.5-5.5,4.5,5.5
51,18-25,18,25
70,5-10,5,10
...,...,...,...
2072,2.75-5,2.75,5
2098,22.5-25,22.5,25
2114,35-40,35,40
2115,4-8,4,8


Converting the datatype of minimum salry are the maximum salry to float from string

In [151]:
df['min_salary'] = df['min_salary'].replace('',np.nan)
df['max_salary'] = df['min_salary'].replace('',np.nan)

In [152]:
df['min_salary'].unique()

array([nan, '10', '1', '4.5', '18', '5', '12', '7', '14', '7.5', '30',
       '20', '25', '8', '15', '35', '2.5', '2', '9', '11', '4', '1.5',
       '3.5', '3', '50', '6', '0', '6.5', '9.5', '13', '27.5', '0.5',
       '3.25', '22.5', '0.8', '5.5', '2.75', '37.5', '1.75', '60'],
      dtype=object)

In [153]:
df[['min_salary','max_salary']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2172 entries, 0 to 2171
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   min_salary  176 non-null    object
 1   max_salary  176 non-null    object
dtypes: object(2)
memory usage: 34.1+ KB


In [157]:
df[['min_salary','max_salary']] = df[['min_salary','max_salary']].astype('float')

In [158]:
df[['min_salary','max_salary']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2172 entries, 0 to 2171
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   min_salary  176 non-null    float64
 1   max_salary  176 non-null    float64
dtypes: float64(2)
memory usage: 34.1 KB


Dropping salary column as it is not required

In [141]:
df=df.drop('salary',axis=1)

In [159]:
df.head()

Unnamed: 0,title,company,ratings,reviews,years,location,days,key_skills,min_salary,max_salary
0,R&D- Data Science and Analytics Lead,Pepsi Foods,4.1,2308.0,2-5,Hyderabad,2.0,"Machine learning,big data",,
1,Specialist- Data Science & Analytics,Carrier,3.8,419.0,2-5,Bengaluru,2.0,"ai,data science,Neural networks,Artificial Int...",,
2,Data Science Analytics Analyst,Accenture,4.0,49777.0,3-5,Bengaluru,3.0,"data mining,machine learning,excel,python,data...",,
3,Data Science & Analytics Engagement Lead,CANPACK,4.3,236.0,0-15,Pune,1.0,"sql,data science,data mining,power bi,machine ...",,
4,"Assoc Manager, Data Science Analytics",,,,3-6,Bengaluru,3.0,"excel,python,data mining",,


####  <u>**Fixing year column**