<a href="https://colab.research.google.com/github/Priyankaverma2024/job-recommendation-system/blob/main/Copy_of_Project_8_Job_Reccomendation_system.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.express as px
import plotly.graph_objects as go
from datetime import datetime

In [None]:
# mount drive
from google.colab import drive
import os

# Create the directory if it doesn't exist
if not os.path.exists('content/drive'):
    os.makedirs('content/drive')

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# load dataset
from pandas import read_csv
df = pd.read_csv('/content/drive/MyDrive/all_upwork_jobs.xlsx.csv')


In [None]:
df.head()

Unnamed: 0,title,link,published_date,is_hourly,hourly_low,hourly_high,budget,country
0,Experienced Media Buyer For Solar Pannel and R...,https://www.upwork.com/jobs/Experienced-Media-...,2024-02-17 09:09:54+00:00,False,,,500.0,
1,Full Stack Developer,https://www.upwork.com/jobs/Full-Stack-Develop...,2024-02-17 09:09:17+00:00,False,,,1100.0,United States
2,SMMA Bubble App,https://www.upwork.com/jobs/SMMA-Bubble-App_%7...,2024-02-17 09:08:46+00:00,True,10.0,30.0,,United States
3,Talent Hunter Specialized in Marketing,https://www.upwork.com/jobs/Talent-Hunter-Spec...,2024-02-17 09:08:08+00:00,True,,,,United States
4,Data Engineer,https://www.upwork.com/jobs/Data-Engineer_%7E0...,2024-02-17 09:07:42+00:00,False,,,650.0,India


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244828 entries, 0 to 244827
Data columns (total 8 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   title           244827 non-null  object 
 1   link            244827 non-null  object 
 2   published_date  244828 non-null  object 
 3   is_hourly       244828 non-null  bool   
 4   hourly_low      102422 non-null  float64
 5   hourly_high     98775 non-null   float64
 6   budget          103891 non-null  float64
 7   country         239751 non-null  object 
dtypes: bool(1), float64(3), object(4)
memory usage: 13.3+ MB


In [None]:
df.isnull().sum()

Unnamed: 0,0
title,1
link,1
published_date,0
is_hourly,0
hourly_low,142406
hourly_high,146053
budget,140937
country,5077


Data Cleaning & Preprocessinga

In [None]:
## Drop unnecessary columns
df.drop(columns=['link'], inplace=True)

In [None]:
## Convert 'published_date' to datetime format
df['published_date'] = pd.to_datetime(df['published_date'], errors='coerce')



In [None]:
## Fill missing numerical values with median
df['hourly_low'] = df['hourly_low'].fillna(df['hourly_low'].median())
df['hourly_high'] = df['hourly_high'].fillna(df['hourly_high'].median())
df['budget'] = df['budget'].fillna(df['budget'].median())

In [None]:
Fill missing values of 'country' and title  with "Unknown"
df['country##'] = df['country'].fillna("Unknown")
df['title'] = df['title'].fillna("Unkown")

In [None]:
df["hourly_avg"] = df[["hourly_low", "hourly_high"]].mean(axis=1)
df["salary"] = df.apply(lambda row: row["hourly_avg"] if row.get("is_hourly", False) else row.get("budget", 0), axis=1)


In [None]:
# Keep only necessary columns and drop missing values
df_cleaned = df[["title", "salary", "published_date", "country"]].dropna().reset_index(drop=True)


In [None]:
import re

In [None]:
# Improved Function to Extract Meaningful Keywords
def extract_keywords(title):
    if isinstance(title, str) and title.strip():
        words = re.findall(r"\b(?:developer|engineer|manager|analyst|designer|consultant|specialist|lead|architect|scientist|administrator|AI|blockchain|cybersecurity|cloud|data|machine learning|remote)\b", title.lower())
        return words if words else None
    return None



In [None]:
df_cleaned['keywords'] = df_cleaned['title'].apply(extract_keywords)

In [None]:
# Remove rows where keyword extraction failed
df_cleaned = df_cleaned.dropna(subset=["keywords"]).reset_index(drop=True)


In [None]:
# Analyze Correlation Between Job Title Keywords and Salaries
keyword_salary_df = df_cleaned.explode("keywords").groupby("keywords")["salary"].mean().reset_index()
fig1 = px.bar(keyword_salary_df, x="keywords", y="salary", title="Average Salary by Job Title Keyword", labels={"salary": "Average Salary", "keywords": "Job Keyword"})
fig1.show()



# kEY Insights:

1.Highest-Paid Job Keywords:

*The keywords "architect" and "blockchain" have the highest average salaries, exceeding 1600. This suggests that professionals with these job roles or skills are highly valued in the market.

2.Moderately High-Paid Job Keywords:

*Keywords like "machine learning," "consultant," "engineer," and "administrator" show relatively higher salaries compared to others, indicating that these roles are also well-compensated.

3.Lower-Paid Job Keywords:

*Keywords such as "scientist," "data," and "lead" have the lowest average salaries, suggesting that these roles might either be entry-level or have lower pay in comparison to others in the dataset.

4.Varied Salary Distribution:

There is a significant variation in average salaries based on job keywords. Some technical and specialized roles (e.g., architect, blockchain) command higher salaries, while generalist roles (e.g., data, scientist) have lower compensation.

5.Emerging Trends:

The presence of keywords like "remote" suggests the growing trend of remote job opportunities, and its salary positioning could indicate how well remote roles are paid relative to others.

6.Business Implications:

Job Seekers: Professionals should aim to upskill in high-demand areas like blockchain, architecture, and machine learning to improve earning potential.

Employers & Recruiters: Companies looking for specialized talent in high-paying roles must be ready to offer competitive salaries.

Career Planning: Individuals entering the job market should consider focusing on skills that align with higher-paying job keywords.


In [None]:
# Identify Emerging Job Categories Based on Posting Frequency
df_trend = df_cleaned.explode("keywords").groupby(["keywords", pd.Grouper(key="published_date", freq="ME")]).size().reset_index(name="count")
fig2 = px.line(df_trend, x="published_date", y="count", color="keywords", title="Emerging Job Categories Over Time")
fig2.show()


# Key Insights:

1.Rapid Growth in Certain Job Categories:

*The keywords with the steepest upward trend indicate high demand and rapid growth.

*The most emerging job categories appear to be related to cloud, blockchain, and machine learning, as they have seen the most significant increase in job postings.

2.Consistent Increase Across Multiple Categories:

*Job roles like analyst, engineer, consultant, and developer are also growing steadily, indicating a consistent demand over time.

*This suggests that these roles remain essential in the job market, and their demand is expected to continue.

3.Early 2024 Job Market Expansion:

*Most job postings started increasing significantly around early February 2024.

*This could be linked to companies ramping up hiring after the new year, seasonal hiring trends, or industry shifts.

4.Remote Jobs are Emerging:

*The presence of the keyword "remote" indicates a rise in remote job opportunities, highlighting the continuing shift towards flexible work arrangements.

5.Lower Growth in Certain Keywords:

*Some job categories, like scientist, lead, and cybersecurity, have a relatively flatter curve, indicating slower growth in postings compared to other roles.

In [None]:
from scipy.stats import pearsonr
from statsmodels.tsa.holtwinters import ExponentialSmoothing
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, LSTM


In [None]:
# Predict High-Demand Job Roles Based on Posting Trends
demand_trend = df_cleaned.groupby(pd.Grouper(key="published_date", freq="ME")).size().reset_index(name="count")

# Check if enough data points are available for seasonal estimation
if len(demand_trend) < 2 * 12:  # 2 full seasonal cycles
    # If not enough data, either gather more data or disable seasonal component
    model_trend = ExponentialSmoothing(demand_trend["count"], trend="add", seasonal=None).fit() # Disable seasonal component
    print("Warning: Not enough data for seasonal estimation. Seasonal component disabled.")
else:
    model_trend = ExponentialSmoothing(demand_trend["count"], trend="add", seasonal="add", seasonal_periods=12).fit()

predictions = model_trend.forecast(12)
fig3 = px.line(x=list(range(len(demand_trend))) + list(range(len(demand_trend), len(demand_trend) + 12)),
               y=list(demand_trend["count"]) + list(predictions),
               title="Forecasted High-Demand Job Roles Over Time",
               labels={"x": "Time (Months)", "y": "Job Postings"})
fig3.show()




Optimization failed to converge. Check mle_retvals.



# Insights from the Forecasted High-Demand Job Roles Over Time Graph
1.Upward Trend in Job Postings:

*The forecasted job postings show a steady increase over time, indicating that demand for jobs is expected to grow in the future.

*This suggests a positive job market outlook, where more positions are likely to become available.

2.Linear Growth Pattern:

*The trend appears to be linear, meaning job postings are increasing at a consistent rate rather than experiencing sudden spikes.

*This could indicate stable growth in high-demand roles rather than seasonal or short-term fluctuations.

3.Warning Messages & Model Limitations:

*The model displays a warning about insufficient seasonal data, meaning the seasonality component was disabled in the prediction.
*As a result, the forecast does not capture any possible seasonal variations (e.g., hiring booms in certain months).

*Additionally, the warning "Optimization failed to converge" suggests that the forecasting model may not be fully optimized, potentially affecting accuracy.

4.Implications for Job Seekers & Employers:

*For job seekers: The increasing trend suggests more opportunities in high-demand job roles, making it a good time to upskill and prepare for these roles.

*For employers: Companies should anticipate a competitive hiring landscape, meaning they may need to enhance their recruitment strategies to attract top talent.

In [None]:
# Compare Average Hourly Rates Across Different Countries
df_country_salary = df_cleaned.groupby("country")["salary"].mean().reset_index()
fig4 = px.choropleth(df_country_salary, locations="country", locationmode="country names",
                     color="salary", hover_name="country",
                     color_continuous_scale="Viridis", title="Average Hourly Rates by Country")
fig4.show()


#Insights from the Choropleth Map (Average Hourly Rates by Country)

1.Variation in Salaries Across Countries

*The map shows significant differences in average hourly salaries across different regions.

*Some countries have higher salaries (yellow-green shades), while others have lower salaries (dark purple shades).

2.High Salary Regions

*Countries with high average hourly wages are marked in yellow-green, likely representing regions such as North
America, Western Europe, and Australia.

*These areas typically have strong economies, higher living costs, and developed job markets.

3.Low Salary Regions

*Many countries, especially in Africa, South America, and parts of Asia, have lower hourly wages (dark purple shades).

*These regions may have lower labor costs, different economic structures, or lower demand for high-paying roles.

4.Possible Data Gaps

*Some areas appear white or light-colored, which might indicate missing data or insufficient job postings from those countries.
*It’s essential to check if the dataset covers all regions equally to avoid bias.

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [None]:
# Prepare Data for Job Recommendation Model
titles = df_cleaned["title"].values
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(titles)
sequences = tokenizer.texts_to_sequences(titles)
padded_sequences = pad_sequences(sequences, maxlen=20)


In [None]:
# Reshape the input data to 3D for LSTM
padded_sequences = padded_sequences.reshape(padded_sequences.shape[0], padded_sequences.shape[1], 1) # Reshape to (num_samples, maxlen, 1)

In [None]:
print(padded_sequences.shape)

(49121, 20, 1)


In [None]:
padded_sequences = np.array(padded_sequences)  # Convert to NumPy array if needed

In [None]:
from tensorflow.keras.layers import Input

In [None]:
model = Sequential([
    Input(shape=(padded_sequences.shape[1],)),
    Embedding(input_dim=5000, output_dim=64),
    LSTM(64, return_sequences=True),
    LSTM(32),
    Dense(32, activation='relu'),
    Dense(len(set(df_cleaned["title"])), activation='softmax')
])

In [None]:
summary = model.summary()
print(summary)

None


In [None]:
model.compile (loss='sparse_categorical_crossentropy',optimizer ='adam',metrics=['accuracy'])

In [None]:
# Convert job titles to category indexes
df_cleaned["title_index"] = df_cleaned["title"].astype('category').cat.codes


In [None]:
# Train the model
model.fit(padded_sequences, df_cleaned["title_index"].values, epochs=10, batch_size=32, validation_split=0.2)


Epoch 1/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 13ms/step - accuracy: 0.0078 - loss: 10.3317 - val_accuracy: 0.0070 - val_loss: 10.4522
Epoch 2/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 13ms/step - accuracy: 0.0124 - loss: 9.6307 - val_accuracy: 0.0193 - val_loss: 10.9389
Epoch 3/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 13ms/step - accuracy: 0.0281 - loss: 9.1396 - val_accuracy: 0.0271 - val_loss: 11.5881
Epoch 4/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 15ms/step - accuracy: 0.0376 - loss: 8.7512 - val_accuracy: 0.0357 - val_loss: 12.2080
Epoch 5/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 13ms/step - accuracy: 0.0518 - loss: 8.3013 - val_accuracy: 0.0475 - val_loss: 12.8567
Epoch 6/10
[1m1228/1228[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 15ms/step - accuracy: 0.0656 - loss: 7.8184 - val_accuracy: 0.0539 - val_loss: 13.56

<keras.src.callbacks.history.History at 0x78d7c04a51d0>

In [None]:
# Function for Job Recommendation
def recommend_jobs(input_text, top_n=5):
    seq = tokenizer.texts_to_sequences([input_text])
    padded = pad_sequences(seq, maxlen=20)
    predictions = model.predict(padded)[0]
    top_indices = np.argsort(predictions)[-top_n:][::-1]
    recommended_jobs = df_cleaned["title"].astype('category').cat.categories[top_indices]
    return recommended_jobs


In [None]:
# Example Usage
print(recommend_jobs("Software Engineer"))


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
Index(['QA Engineer', 'Software Engineer', 'Software Architect',
       'Wordpress Designer', 'DevOps Engineer'],
      dtype='object')


In [None]:
model.save("JOB Recommendartion model.keras")


In [None]:
# Track Changes in Job Market Dynamics Over Months
df_monthly_trends = df_cleaned.groupby(pd.Grouper(key="published_date", freq="ME")).size().reset_index(name="job_postings")
fig5 = px.line(df_monthly_trends, x="published_date", y="job_postings", title="Monthly Job Market Trends",
               labels={"date_posted": "Month", "job_postings": "Number of Job Postings"})
fig5.show()


# Insights from the Monthly Job Market Trends Graph

1.Steady Growth in Job Postings

*The number of job postings has significantly increased over time.

*Initially, from December 2023 to late January 2024, job postings were minimal or absent.

*From February 2024 onwards, there's a sharp and continuous increase in job postings.

2.Possible Reasons for the Trend

*Seasonal Hiring: The increase from February to March might be due to companies opening new positions after the holiday season.

*Market Recovery: A growing job market could indicate economic recovery or increased demand in specific industries.

*Hiring Cycles: Many companies plan new hiring budgets at the start of the year, leading to a surge in job postings.

3.Future Predictions

*If the trend continues, March and April may see even higher job postings.

*The job market is currently in an expansion phase, which is beneficial for job seekers.

In [None]:
# Investigate Trends in the Remote Work Landscape
df_remote_trends = df_cleaned.groupby([pd.Grouper(key="published_date", freq="ME"), "country"]).size().reset_index(name="count")
fig6 = px.line(df_remote_trends, x="published_date", y="count", color="country", title="Trends in Remote Work Over Time",
               labels={"published_date": "Month", "count": "Number of Remote Job Postings", "remote": "Remote Status"})
fig6.show()


#Insights from the Trends in Remote Work Over Time Graph
1.Increase in Remote Job Postings

*Initially, from December 2023 to late January 2024, there were very few remote job postings.

*Starting in February 2024, remote job postings increased significantly.

*The United States (blue line) has the highest number of remote job postings, followed by a few other countries like Netherlands, Israel, and the United Kingdom.
2.Country-Specific Trends

*The United States dominates remote job postings, showing a sharp and continuous increase.

*Other countries, such as Netherlands, Israel, and the United Kingdom, show moderate growth in remote jobs.
Several other countries show small but consistent increases in remote job postings.
3.Possible Reasons for the Trends

*Post-Pandemic Job Market: Companies continue to offer remote work options due to changing work culture.

*Tech and IT Sector Growth: Many remote jobs are tech-related, and the US dominates the IT sector.

*Global Hiring Trends: Companies may be hiring remote workers from different countries to tap into global talent.
4.Future Predictions

*Remote job opportunities may continue to grow, especially in tech and service-based industries.

*More companies may adopt hybrid or fully remote work models to attract a global workforce.

In [None]:
# Predict Future Job Market Trends
# Check if enough data points are available for seasonal estimation
if len(df_monthly_trends) < 2 * 12:  # 2 full seasonal cycles
    # If not enough data, either gather more data or disable seasonal component
    future_model = ExponentialSmoothing(df_monthly_trends["job_postings"], trend="add", seasonal=None).fit()  # Disable seasonal component
    print("Warning: Not enough data for seasonal estimation. Seasonal component disabled.")
else:
    future_model = ExponentialSmoothing(df_monthly_trends["job_postings"], trend="add", seasonal="add", seasonal_periods=12).fit()

future_predictions = future_model.forecast(12)
fig7 = px.line(x=list(df_monthly_trends["published_date"]) + list(pd.date_range(start=df_monthly_trends["published_date"].max(), periods=12, freq="ME")),
               y=list(df_monthly_trends["job_postings"]) + list(future_predictions),
               title="Predicted Future Job Market Trends",
               labels={"x": "Time (Months)", "y": "Job Postings"})
fig7.show()




Optimization failed to converge. Check mle_retvals.



Analysis of Predicted Future Job Market Trends
1. Overall Trend

The graph represents predicted job postings over time from January 2024 to early 2025.
Initially, job postings were very low, but around March 2024, there is a sharp increase.
After March 2024, the number of job postings steadily rises in a linear trend, reaching over 140k job postings by early 2025.
2. Key Observations

✅ Sudden Growth in Early 2024:

The job market appears to recover or expand rapidly during the first few months of 2024.
Possible reasons: economic recovery, seasonal hiring trends, or increased demand for jobs post-pandemic.

✅ Stable Growth After Initial Surge:

After the initial rise, job postings grow steadily at a consistent rate.

This suggests a healthy job market expansion with no sudden drops or slowdowns.
3. Possible Causes for the Growth

📌 Economic Recovery & Business Expansion → Companies expanding hiring post-recession/pandemic.

📌 Increase in Remote & Tech Jobs → More businesses adopting flexible/remote work options.

📌 Government Policies & Investments → New policies stimulating job creation.

📌 AI & Automation Impact → Companies may create new jobs while automating others.

In [None]:
!pip install streamlit pyngrok



In [None]:
!streamlit run aap.py & npx localtunnel --port 8501


Usage: streamlit run [OPTIONS] TARGET [ARGS]...
Try 'streamlit run --help' for help.

Error: Invalid value: File does not exist: aap.py
[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0Kyour url is: https://stupid-donuts-sit.loca.lt
^C


In [None]:
from pyngrok import ngrok

# Start Streamlit
!streamlit run aap.py &

# Create a public URL
public_url = ngrok.connect(port="8501")
print(f"Streamlit is live at: {public_url}")



Usage: streamlit run [OPTIONS] TARGET [ARGS]...
Try 'streamlit run --help' for help.

Error: Invalid value: File does not exist: aap.py


ERROR:pyngrok.process.ngrok:t=2025-02-26T10:34:53+0000 lvl=eror msg="failed to reconnect session" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-02-26T10:34:53+0000 lvl=eror msg="session closing" obj=tunnels.session err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n"
ERROR:pyngrok.process.ngrok:t=2025-02-26T10:34:53+0000 lvl=eror msg="terminating with error" obj=app err="authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your aut

PyngrokNgrokError: The ngrok process errored on start: authentication failed: Usage of ngrok requires a verified account and authtoken.\n\nSign up for an account: https://dashboard.ngrok.com/signup\nInstall your authtoken: https://dashboard.ngrok.com/get-started/your-authtoken\r\n\r\nERR_NGROK_4018\r\n.