**<center> <span style="color:#a10303;font-family:serif; font-size:34px;"> Client Retention Prediction Based on Client Behavior
📈</span> </center>**

---

🔍 **Objective:**  
Leverage AI to predict whether a client is likely to continue working with a freelancer based on project data such as rating, price, review count, and country.

📊 **Dataset Source:**  
[Freelancer Dataset on Kaggle](https://www.kaggle.com/datasets/isaacoresanya/freelancer)

🧠 **Approach:**  
- Data preprocessing & feature engineering  
- Natural Language Processing on `job_title`, `description`, and `tags`  
- Categorical encoding and  **currency normalization using real-time exchange rates from Fixer.io API** 
- Model training with Random Forest Classifier  
- Cross-validation to ensure performance stability  

📈 **Key Metric:**  
F1-score used to balance precision and recall in predicting client retention.

✅ **Results:**  
Achieved an F1-score of **99.75%** across 5-fold cross-validation, showing strong predictive performance and model stability.

---


In [1]:
import ast
import pandas as pd
import plotly.express as pt
from collections import Counter
from scipy.sparse import hstack, csr_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report, confusion_matrix,accuracy_score
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer

In [2]:
df=pd.read_csv(r'D:\AI\MastryHub\freelancer_job_postings.csv')
df.head(3)

Unnamed: 0,projectId,job_title,job_description,tags,client_state,client_country,client_average_rating,client_review_count,min_price,max_price,avg_price,currency,rate_type
0,37426471,development and implementation of a federated ...,please bid only if you are ready to do the wor...,"['algorithm', 'java', 'python', 'machine learn...",Heilbronn,Germany,5.0,17,8.0,30.0,19.0,EUR,fixed
1,37421546,Data Entry -- 2,Project Title: Data Entry - Data Analysis in E...,"['excel', 'statistical analysis', 'statistics'...",Nagpur,India,0.0,0,750.0,1250.0,1000.0,INR,hourly
2,37400492,Data Scrap,I am looking for a freelancer who can help me ...,"['web scraping', 'data mining', 'data entry', ...",Eaubonne,France,5.0,1,30.0,250.0,140.0,EUR,fixed


# Data Preprocessing

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9193 entries, 0 to 9192
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   projectId              9193 non-null   int64  
 1   job_title              9193 non-null   object 
 2   job_description        9193 non-null   object 
 3   tags                   9193 non-null   object 
 4   client_state           8828 non-null   object 
 5   client_country         9192 non-null   object 
 6   client_average_rating  9193 non-null   float64
 7   client_review_count    9193 non-null   int64  
 8   min_price              9193 non-null   float64
 9   max_price              9193 non-null   float64
 10  avg_price              9193 non-null   float64
 11  currency               9193 non-null   object 
 12  rate_type              9193 non-null   object 
dtypes: float64(4), int64(2), object(7)
memory usage: 933.8+ KB


In [4]:
df.describe(include='O')

Unnamed: 0,job_title,job_description,tags,client_state,client_country,currency,rate_type
count,9193,9193,9193,8828,9192,9193,9193
unique,8572,8638,7258,3092,177,9,2
top,Data Analysis,I need the statistical analysis expert.. for ...,['python'],Riyadh,India,USD,fixed
freq,47,13,40,177,2743,4444,7322


In [5]:
df.isna().sum()

projectId                  0
job_title                  0
job_description            0
tags                       0
client_state             365
client_country             1
client_average_rating      0
client_review_count        0
min_price                  0
max_price                  0
avg_price                  0
currency                   0
rate_type                  0
dtype: int64

##  Preprocessing Summary

- Filled missing values (`currency`, `client_state`, prices) using defaults/mode.
- Normalized all prices to USD using real-time rates from **Fixer.io API**.
- Encoded categorical columns (`country`, `state`, `currency`, `rate_type`) using label/one-hot encoding.
- Applied TF-IDF to `job_title`, `job_description`; MultiLabelBinarizer to `tags`.

---

**Fill missing with most frequent value (mode)**


In [6]:
df['client_country'] = df['client_country'].fillna(df['client_country'].mode()[0])
df['client_state'] = df['client_state'].fillna(df['client_state'].mode()[0])

 **Currency Normalization via Fixer.io API**

Real-time exchange rates were fetched using [Fixer.io](https://fixer.io/) API to ensure consistent pricing across projects. Each `min_price`, `max_price`, and `avg_price` was converted to USD for fair model training.


In [7]:
import requests
def fetch_currency_rates(api_key):                                       
    url = f"http://data.fixer.io/api/latest?access_key={api_key}&symbols=USD,GBP,INR,NZD,EUR,AUD,CAD,HKD,SGD"
    try:
        response = requests.get(url, timeout=10)
        response.raise_for_status()
        data = response.json()
        if not data.get("success", False):
            print("Error from API:", data.get("error", "Unknown error"))
            return None
        eur_to_currency = data["rates"]
        currency_to_usd = {
            cur: eur_to_currency['USD'] / rate if rate != 0 else 1.0
            for cur, rate in eur_to_currency.items()
        }
        print("Currency rates fetched successfully!")
        return currency_to_usd
    except requests.exceptions.Timeout:
        print("Request timed out.")
    except requests.exceptions.HTTPError as http_err:
        print(f" HTTP error occurred: {http_err}")
    except requests.exceptions.RequestException as req_err:
        print(f" Request error: {req_err}")
    except Exception as e:
        print(f" Unexpected error: {e}")

    return None

api_key = 'f76f22c9458161575be77cf242ecc38f'
currency_to_usd = fetch_currency_rates(api_key)

def convert_to_usd(row):
    currency = row['currency']
    rate = currency_to_usd.get(currency, 1.0) if currency_to_usd else 1.0
    return pd.Series({
        'min_price_usd': row['min_price'] * rate,
        'max_price_usd': row['max_price'] * rate,
        'avg_price_usd': row['avg_price'] * rate
    })

df[['min_price', 'max_price', 'avg_price']] = df[['min_price', 'max_price', 'avg_price']].fillna(0)
converted_prices = df.apply(convert_to_usd, axis=1)
df = pd.concat([df, converted_prices], axis=1)
df.drop(['min_price', 'max_price', 'avg_price'], axis=1, inplace=True)
print("Currency conversion complete. DataFrame is ready!")

Currency rates fetched successfully!
Currency conversion complete. DataFrame is ready!


In [8]:
df.describe()

Unnamed: 0,projectId,client_average_rating,client_review_count,min_price_usd,max_price_usd,avg_price_usd
count,9193.0,9193.0,9193.0,9193.0,9193.0,9193.0
mean,35328580.0,2.407593,20.823126,179.252286,436.204415,307.728351
std,1973329.0,2.472512,87.454084,1738.837498,2631.836532,2153.440823
min,31083240.0,0.0,0.0,1.165391,1.177045,1.171218
25%,33565860.0,0.0,0.0,9.355848,25.0,20.0
50%,36326560.0,0.0,0.0,17.480869,145.673904,81.577386
75%,37024160.0,5.0,5.0,30.0,250.0,140.0
max,37439380.0,5.0,969.0,145000.0,170000.0,157500.0


# Data Visualization

  📊 **Distribution of Client Ratings**

In [9]:
rating_counts = df['client_average_rating'].value_counts(normalize=True).sort_index()
rating_percent = (rating_counts * 100).reset_index()
rating_percent.columns = ['rating', 'percentage']
fig = pt.bar(rating_percent, x='rating', y='percentage',
             color_discrete_sequence=['lightcoral'],
             labels={'rating': 'Client Rating', 'percentage': 'Percentage (%)'},
             title='Client Average Rating Distribution (as Percentage)')
fig.update_layout(yaxis_tickformat='.1f',  
                  xaxis_title='Rating',
                  yaxis_title='Percentage (%)')
fig.show()

🌍 **Top Countries by Number of Clients**

In [10]:
top_countries = df['client_country'].value_counts().nlargest(10).reset_index()
top_countries.columns = ['country', 'count']
fig = pt.bar(top_countries, x='country', y='count', color='count', color_continuous_scale='Blues')
fig.update_layout(title='Top 10 Client Countries', xaxis_title='Country', yaxis_title='Number of Clients')
fig.show()

In [11]:
fig = pt.scatter(df, x='client_review_count', y='avg_price_usd', color='client_country',
                 size='client_average_rating', hover_data=['rate_type', 'currency'],
                 title='Avg Price vs Client Review Count by Country')
fig.update_layout(xaxis_title='Client Review Count', yaxis_title='Average Price')
fig.show()

**Tags Word Frequency**

In [12]:
tags_series = df['tags'].dropna()
all_tags = []
for tag_list in tags_series:
    try:
        parsed = ast.literal_eval(tag_list)
        cleaned = [tag.strip().lower() for tag in parsed if isinstance(tag, str)]
        all_tags.extend(cleaned)
    except:
        continue  

tag_counts = Counter(all_tags)
tag_df = pd.DataFrame(tag_counts.items(), columns=['tag', 'count'])
tag_df = tag_df.sort_values('count', ascending=False).head(20)
fig = pt.bar(tag_df, x='tag', y='count', title='Top 20 Most Frequent Tags',
             color='count', color_continuous_scale='tealgrn')
fig.update_layout(xaxis_title='Tag', yaxis_title='Count', xaxis_tickangle=-45)
fig.show()

**Encoding**

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9193 entries, 0 to 9192
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   projectId              9193 non-null   int64  
 1   job_title              9193 non-null   object 
 2   job_description        9193 non-null   object 
 3   tags                   9193 non-null   object 
 4   client_state           9193 non-null   object 
 5   client_country         9193 non-null   object 
 6   client_average_rating  9193 non-null   float64
 7   client_review_count    9193 non-null   int64  
 8   currency               9193 non-null   object 
 9   rate_type              9193 non-null   object 
 10  min_price_usd          9193 non-null   float64
 11  max_price_usd          9193 non-null   float64
 12  avg_price_usd          9193 non-null   float64
dtypes: float64(4), int64(2), object(7)
memory usage: 933.8+ KB


In [14]:
cat_cols = df.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
    print(f"{col}: {df[col].nunique()}")

job_title: 8572
job_description: 8638
tags: 7258
client_state: 3092
client_country: 177
currency: 9
rate_type: 2


In [15]:
# Label Encoding
le_country = LabelEncoder()
df['client_country_encoded'] = le_country.fit_transform(df['client_country'])
le_state = LabelEncoder()
df['client_state_encoded'] = le_state.fit_transform(df['client_state'])
# One-Hot Encoding
df = pd.get_dummies(df, columns=['currency', 'rate_type'])
df.drop(['projectId','client_country', 'client_state','currency', 'rate_type'], axis=1,errors='ignore', inplace=True)


In [16]:
tfidf_title = TfidfVectorizer(max_features=100, stop_words='english')
X_title = tfidf_title.fit_transform(df['job_title'].fillna(""))
tfidf_desc = TfidfVectorizer(max_features=200, stop_words='english')
X_desc = tfidf_desc.fit_transform(df['job_description'].fillna(""))

In [17]:
def process_tags(x):
    try:
        tags = ast.literal_eval(x)
        return [t.strip().lower() for t in tags if isinstance(t, str)]
    except:
        return []

df['tags_cleaned'] = df['tags'].apply(process_tags)
mlb = MultiLabelBinarizer(sparse_output=True)
X_tags = mlb.fit_transform(df['tags_cleaned'])

In [18]:
num_cols = [
    'client_average_rating', 'client_review_count',
    'min_price_usd', 'max_price_usd', 'avg_price_usd',
    'client_country_encoded', 'client_state_encoded',
    'currency_AUD', 'currency_CAD', 'currency_EUR',
    'currency_GBP', 'currency_HKD', 'currency_INR', 'currency_NZD',
    'currency_SGD', 'currency_USD',
    'rate_type_fixed', 'rate_type_hourly'
]
X_numeric = csr_matrix(df[num_cols].astype(float).values)
X_all = hstack([X_numeric, X_title, X_desc, X_tags])

In [19]:
df['client_retained'] = df['client_average_rating'].apply(lambda x: 1 if x >= 4 else 0)
y = df['client_retained'].values

# Model

##  Model & Training

- Model: `RandomForestClassifier`
- Combined numeric, encoded, and text-based features.
- Trained with 80/20 split + 5-fold cross-validation for reliability.

---

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.2, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy*100)
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

Accuracy: 99.72811310494835
[[944   5]
 [  0 890]]
              precision    recall  f1-score   support

           0       1.00      0.99      1.00       949
           1       0.99      1.00      1.00       890

    accuracy                           1.00      1839
   macro avg       1.00      1.00      1.00      1839
weighted avg       1.00      1.00      1.00      1839



# Model Evaluation

To evaluate the model's performance, we used **5-fold cross-validation** with the F1 score as the metric. This approach ensures a more robust estimate of model performance across different subsets of the data.


In [21]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X_all, y, cv=5, scoring='f1')
print(scores)
print("Average F1:", scores.mean())

[0.99943662 0.99494666 0.99887387 0.99831176 0.99606521]
Average F1: 0.997526823867895


## 📈 Evaluation & Results

- **F1-Score:** `0.997` average across folds ✅


In [22]:
import joblib
import os

os.makedirs("artifacts", exist_ok=True)
joblib.dump(tfidf_title, "artifacts/tfidf_title.pkl")
joblib.dump(tfidf_desc, "artifacts/tfidf_desc.pkl")
joblib.dump(le_country, "artifacts/le_country.pkl")
joblib.dump(le_state, "artifacts/le_state.pkl")
joblib.dump(mlb, "artifacts/mlb_tags.pkl")
joblib.dump(model, "artifacts/model.pkl")


['artifacts/model.pkl']

In [23]:
import sklearn
print(sklearn.__version__)

1.4.1.post1


In [24]:
#! pip install --upgrade scikit-learn==1.4.1.post1


In [25]:
print("Title TF-IDF shape:", title_vec.shape)
print("Desc TF-IDF shape:", desc_vec.shape)
print("Tags shape:", tags_vec.shape)

NameError: name 'title_vec' is not defined