### Importing the neccessary libraries.

in this section, we are going to import all necessary libraries for this project. the most notable libraries here apart from 
pandas and numpy is selenium which is a very good automation library, and which we will utilise it to scrape the website we are interested in.

In [39]:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import time
import pandas as pd
import plotly.express as px
import seaborn as sns
import matplotlib.pyplot as plt

### Automating the website we want to scrape using selenium.


Here, we have automated a visit to the mentioned website and then extracted all of its HTML contens. the scrolling method of
the website is dynamic which means it scrolls down automatically when the scroller reachs a certain position of the page, having that in mind,
we have given it a 5 seconds pause to refresh itself and scroll down untill there is no more pages left. that is when we tell it to quit and save the content to our variable. 

The result is a varialable that holds the entire HTML content of the jobs section of the page, which is what we are looking for obviously.

In [40]:

# Set up the browser and navigate to the page
driver = webdriver.Chrome()
driver.get("https://somalijobs.com/jobs")

# Define the pause time (in seconds) between scrolls
SCROLL_PAUSE_TIME = 5

# Get the height of the initial page
last_height = driver.execute_script("return document.body.scrollHeight")

# Scroll down the page until the end
while True:
    # Scroll down to the bottom of the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

    # Wait to load the page
    time.sleep(SCROLL_PAUSE_TIME)

    # Calculate new scroll height and compare with last scroll height
    new_height = driver.execute_script("return document.body.scrollHeight")
    if new_height == last_height:
        break
    last_height = new_height

# Extract the HTML content of the page
html = driver.page_source

# Parse the HTML content with BeautifulSoup
#soup = BeautifulSoup(html, "html.parser")

# Extract the data you need from the parsed HTML content
# ...

# Close the browser window
driver.quit()

### Using BeautifulSoup to extract the html elements we are interested in, collecting the data we need and saving it as a pandas Dataframe.

In [41]:
#create a beutifulsoup instance
soup = BeautifulSoup(html, 'html.parser')

# find all html divs with class name of 'jobs-container'
all_page = soup.find('div', id= 'jobs-container')

# find all sub-divs in the former div with class name of 'jmg-right'
jobs= all_page.find_all('div', class_= 'jmg-right')

#initialize empty lists for latter use.
titles = []
company_names = []
positions = []
overviews = []
departments = []
locations = []
Dates = [] 

# a for loob that iterates over the jobs variable, looks for specific content and saves it to appropriate variables, 
# and at last appends those variables to there respective lists.

for job in jobs:
    title = job.find("h4", class_="jmg-title" ).text.strip()
    company_name = job.find('h4', class_ = "jmg-company-title").text.strip()
    position = job.find('span', class_ = 'skl-6').text[22:]
    overview = job.find('div', class_ = 'job-listing-1-overview').text.strip()
    department = job.find('span', class_ = 'skl-2').text.strip()
    location = job.find('span', class_ = 'skl-3').text.strip()
    

    titles.append(title)
    company_names.append(company_name)
    positions.append(position)
    overviews.append(overview)
    departments.append(department)
    locations.append(location)
    #Dates.append(date)

    # we finally create a dataframe holding the data have got
    Jobs= pd.DataFrame({'Title': titles, 'Company Name': company_names, 'Position': positions, 'Overview': overviews, 'Departments': departments, 'Location': locations})


### The Dataframe we created.

Below is the result of our scrapper; the data that we have successfully scrapped from the somalijobs website, as a pandas dataframe

In [94]:
Jobs.head()

Unnamed: 0,Title,Company Name,Position,Overview,Departments,Location
0,"Summary of job opportunities, Tenders/contract...",Somalijobs Inc,Full Time,#EDC is hiring Chargé(e) de suivi et évaluatio...,Human Resource And Administration,Multiple Locations
1,Chargé(e) de suivi et évaluation,Education Development Center (EDC),Full Time,Education Development Center (EDC) est une org...,Monitoring And Evaluation,Djibouti
2,Assistant(e) de formation,Education Development Center (EDC),Full Time,Description de l'entreprise\nEducation Develop...,Social Affairs,Djibouti
3,Project Officer - Food Accounting Information ...,World Vision International (WVI),Full Time,Employee Contract Type:\nLocal - Fixed Term Em...,Project Management/ Assistant,Garowe
4,Camp Coordination and Camp Management Manager,Danish Refugee Council (DRC),Full Time,DRC Somalia/Land\nDRC has been operating in So...,Management/leadership,Beledweyne


### Cleaning the data from the unwanted strings. 

As we can see from our initial unsanatized dataframe, there some data cleaning required to make the dataframe ready for analysis,
let us deal with it.

In [43]:
Jobs["Departments"] = Jobs["Departments"].str.strip('  border_clear \n ')
Jobs["Position"] = Jobs["Position"].str.strip('  b\n               ... ')
Jobs["Location"]=  Jobs["Location"].str.strip('     room \n                        ')


In [44]:
Jobs.to_csv('jobs-10-2023.csv', index=False)

This is the new look of our data, it is almost clean and ready for analysis

In [58]:
Jobs.head()

Unnamed: 0,Title,Company Name,Position,Overview,Departments,Location
0,"Summary of job opportunities, Tenders/contract...",Somalijobs Inc,Full Time,#EDC is hiring Chargé(e) de suivi et évaluatio...,Human Resource And Administration,Multiple Locations
1,Chargé(e) de suivi et évaluation,Education Development Center (EDC),Full Time,Education Development Center (EDC) est une org...,Monitoring And Evaluation,Djibouti
2,Assistant(e) de formation,Education Development Center (EDC),Full Time,Description de l'entreprise\nEducation Develop...,Social Affairs,Djibouti
3,Project Officer - Food Accounting Information ...,World Vision International (WVI),Full Time,Employee Contract Type:\nLocal - Fixed Term Em...,Project Management/ Assistant,Garowe
4,Camp Coordination and Camp Management Manager,Danish Refugee Council (DRC),Full Time,DRC Somalia/Land\nDRC has been operating in So...,Management/leadership,Beledweyne


#### Our new Dataframe.

Now that we have saved our clean dataframe as a csv file, lets read it and continue our analysis and build our visualisations 

In [66]:
# Read the newly saved csv file
jobs_data = pd.read_csv('jobs-10-2023.csv')


In [64]:
jobs_data.head()

Unnamed: 0,Title,Company Name,Position,Overview,Departments,Location
0,"Summary of job opportunities, Tenders/contract...",Somalijobs Recruitment Service,Full Time,#FederalGovernmentOfSomalia is hiring IVTC Cen...,Human Resource And Administration,Multiple Locations
1,IVTC Centre Manager,Federal Government Of Somalia,Consultant,FEDERAL REPUBLIC OF SOMALIA\n \nMINISTRY OF LA...,Consultancies,Mogadishu
2,Re-Advert: Chief Cashier,Shabelle Bank,Full Time,Shabelle Bank is a fully licensed Commercial B...,Finance And Accounting,Gode
3,Customer Care and Finance Officer,Hanaan Travel and Tour Agency,Full Time,About Hanaan Travel and Tour Agency:\n \nHanaa...,Finance And Accounting,Garowe
4,Project Officer,Secours Islamique France (SIF),Full Time,VACANCY ADVERTISMENT\nJOB TITLE: ...,Project Management/ Assistant,Kismay


### Visualisations

Let us create some charts conclude some interesting insights from them.
we will use the  plotly library which has some beautifull and interactive charts.

#### 1- The Percentage each city recieved from the total of all jobs offered.

In [89]:
jobs_data['Location'].value_counts().to_frame().head()

Unnamed: 0,Location
Mogadishu,39
Hargeisa,16
Nairobi,13
Somalia,9
Djibouti,9


In [97]:
#Create a new series from Location column of our dataframe
locations = jobs_data['Location'].value_counts()
# Turn it to a Dataframe- i will need some dataframe methods
locations = locations.to_frame()

# Create a new column 
locations["Percentage"] = locations.Location/locations.Location.sum()*100
# Round it
locations["Percentage"]   = locations.Percentage.round().astype(str) + " %"

# Create a barchar using plotly
fig = px.bar(
             locations, x = locations.index , y ='Location',  
             title = " The Percentage each city recieved from the total of all jobs offered.",
             height=600,
             width = 1000,
             color = "Location",
            template='plotly_dark',
             text = locations.Percentage,
             labels={
                     "Location": "Count"
                   
                 },
)
    
fig.update_layout( xaxis={'categoryorder': 'total ascending'})

fig.show()

In this chart, the emphasis was to quantify the number of jobs advertised in each city/location in the dataset and it is percetage from the total number of jobs.

in the gragh, there is a couple of notable outliers in terms of the number of jobs they recieved out of the total.
for example, Mogadishu, the capital has recieved more than a quarter of the jobs inlisted in the website, which is more than double the jobs advertised in the second position city/location,
which is by the way the second largest city of the country.
in a nutshell; the two largest cities in the country recieved 37% of all jobs inlisted in the dataset, leaving the other 16 major cities to devide the rest 63%.

There is also some data points which are understandably ambigeous to interpreter. for example, Somalia recieved 6% of all jobs, which makes it hard to interpret, because we are interested in looking for individual cities with in somalia for our analysis.

#### 2-The Percentage of each job title from the total number of job titles offered.

In [101]:
#Create a new series from Location column of our dataframe
departments = jobs_data['Departments'].value_counts()
# Turn it to a Dataframe- i will need some dataframe methods
departments = departments.to_frame()

# Create a new column 
departments["Percentage"] = departments.Departments/departments.Departments.sum()*100
# Round it
departments["Percentage"]   = departments.Percentage.round().astype(str) + " %"

# Create a barchar using plotly
fig = px.bar(
             departments, x = departments.index , y ='Departments',  
             title = "The Percentage of each job title from the total number of job titles offered." , 
             height=800,
             width = 1100,
             color = 'Departments',
             template='plotly_dark',
             text = departments.Percentage,
             labels={
                     "Percentage": "Count"
                   
                 },
            )
    
fig.update_layout( xaxis={'categoryorder': 'total ascending'})
fig.show()

Couple of points from the chart:

   1- Finance and Accounting positions where the most frequent job titles in the dataset, each accounting to 13%.
   
   2- also Consultant roles and Project Management were also very popular in the data, each accomulating 9%, 7% respectively.
   
   3- interestingly, Technology related roles are absent from the data, which came as a suprise and disappointing to me as i was expecting a different result.

#### 3- Percentage of each Position offered out of the total number of all positons.

In [103]:
#Create a new series from Location column of our dataframe
Positions = jobs_data['Position'].value_counts()
# Turn it to a Dataframe- i will need some dataframe methods
Positions = Positions.to_frame()

# Create a new column 
Positions["Percentage"] = Positions.Position/Positions.Position.sum()*100
# Round it
Positions["Percentage"]   = Positions.Percentage.round().astype(str) + " %"

# Create a barchar using plotly
fig = px.bar(
             Positions, x = Positions.index , y ='Position',  
             title = "Percentage of each Position offered out of the total number of all positons." , 
             height=600,
             width = 800,
             color = 'Position',
             template='plotly_dark',
             text = Positions.Percentage,
             labels={
                     "Percentage": "Count"
                   
                 },
            )
    
fig.update_layout( xaxis={'categoryorder': 'total ascending'})
fig.show()

Obvious from the chart, Fulltime positions are almost 80% of the total positions, Consultant Positions taking 19%. Interships and Part Time are accounting less than 3%.

In [72]:
(jobs_data['Departments'].eq('Project Manager')).any()

False

In [73]:
jobs_data["Company Name"].value_counts()

UNHCR                                             12
Save The Childrens International                  11
World Food Programme (WFP)                        10
Federal Government Of Somalia                      8
United Nations                                     7
                                                  ..
SHACDO                                             1
Hill Development Inc                               1
International Organization For Migration (IOM)     1
Impact Initiatives                                 1
Halo Trust                                         1
Name: Company Name, Length: 63, dtype: int64