# **Recommendation System**

##### Leveraging state-of-the-art machine learning techniques to provide personalized and relevant recommendations to users aiming to enhance engagement and improve learning outcomes.


#### Intel optimizations 
`patch_sklearn()` Enabling Intel Extension for Scikit-learn

In [2]:
#Intel Installs and config

!pip install langdetect
!pip install modin[all]
!pip install scikit-learn-intelex
from sklearnex import patch_sklearn
patch_sklearn()

from google.colab import drive
drive.mount('/content/drive')



Intel(R) Extension for Scikit-learn* enabled (https://github.com/intel/scikit-learn-intelex)


Mounted at /content/drive


### Importing necessary libraries for data processing and analysis

#### Intel optimization
`import modin.pandas as pd` Using modin , intel's replacement for pandas

In [16]:
import numpy as np # linear algebra# data processing, CSV file I/O (e.g. pd.read_csv)
from statistics import harmonic_mean
from langdetect import detect
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer, CountVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import f1_score
from sklearn.metrics.pairwise import cosine_similarity
import modin.pandas as pd


In [17]:
# Reading the CSV file & store it into modin dataframe
df = pd.read_csv("coursea_data.csv")

In [18]:
# Dropping unnecessary columns from the DataFrame
df.drop(['Unnamed: 0', 'course_organization'], axis=1, inplace=True)


Analyzing the distribution of enrollment numbers can Provide valuable insights for understanding user preferences and trends & can help identify popular courses or enrollment pattern.

In [19]:
# Counting the distribution of the last digit of the 'course_students_enrolled' column

df.course_students_enrolled.apply(lambda count : count[-1]).value_counts()

k    887
m      4
Name: course_students_enrolled, dtype: int64

In [20]:
df = df[df.course_students_enrolled.str.endswith('k')]

In [11]:
df['course_students_enrolled'] = df['course_students_enrolled'].apply(lambda enrolled : eval(enrolled[:-1]) * 1000)
df.head()

Unnamed: 0,course_title,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,(ISC)² Systems Security Certified Practitioner...,SPECIALIZATION,4.7,Beginner,5300.0
1,A Crash Course in Causality: Inferring Causal...,COURSE,4.7,Intermediate,17000.0
2,A Crash Course in Data Science,COURSE,4.5,Mixed,130000.0
3,A Law Student's Toolkit,COURSE,4.7,Mixed,91000.0
4,A Life of Happiness and Fulfillment,COURSE,4.8,Mixed,320000.0


 ### The **MinMaxScaler** is a popular data preprocessing technique that scales numerical features to a specified range, typically between 0 and 1.

 By calculating the minimum and maximum values for each column and scales the corresponding values in the range [0, 1], we ensure that both features contribute proportionally to the recommendation system.

In [None]:
minmax_scaler = MinMaxScaler()
scaled_ratings = minmax_scaler.fit_transform(df[['course_rating','course_students_enrolled']])

### The **harmonic mean** is used to calculate an overall rating for each course by combining the course rating and the number of students enrolled. It provides a way to consider both factors simultaneously and derive a single measure that represents the combined information in a meaningful manner.

In [None]:
# calculating an overall rating using the harmonic mean of the course rating and number of students enrolled.
df['course_rating'] = scaled_ratings[:,0]
df['course_students_enrolled'] = scaled_ratings[:,1]
df['overall_rating'] = df[['course_rating','course_students_enrolled']].apply(lambda row : harmonic_mean(row), axis=1)

In [None]:
df.head()

Unnamed: 0,course_title,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled,overall_rating
0,(ISC)² Systems Security Certified Practitioner...,SPECIALIZATION,0.823529,Beginner,0.004587,0.009122
1,A Crash Course in Causality: Inferring Causal...,COURSE,0.823529,Intermediate,0.018709,0.036586
2,A Crash Course in Data Science,COURSE,0.705882,Mixed,0.1551,0.254319
3,A Law Student's Toolkit,COURSE,0.823529,Mixed,0.108027,0.190999
4,A Life of Happiness and Fulfillment,COURSE,0.882353,Mixed,0.38443,0.535534


In [None]:
df = df[df.course_title.apply(lambda title : detect(title) == 'en')]

**Term Frequency-Inverse Document Frequency (TF-IDF)**

**A class from the sklearn.feature_extraction.text module.**

It is popular technique used in natural language processing to represent text data numerically.

By vectorizing the course titles using TF-IDF, we can transform the textual data into a numerical representation that machine learning algorithms can work with. This allows us to calculate similarity measures between courses, identify important keywords, and incorporate text-based features.

In [None]:
vectorizer = TfidfVectorizer(stop_words='english')
vectors = vectorizer.fit_transform(df.course_title)

By computing the **cosine similarity (function from the pairwise module within the sklearn.metrics module in scikit-learn).**, we obtain a similarity score for each course in the dataset compared to the input course title.

This allows us to identify courses that are most similar to the input course title, which is a crucial step in generating relevant recommendations.


In [None]:
# recommend function will help us to recommend the books based on the title we have given

def recommend_by_course_title (title, recomm_count=10) :
    title_vector = vectorizer.transform([title])
    cosine_sim = cosine_similarity(vectors, title_vector)
    idx = np.argsort(np.array(cosine_sim[:,0]))[-recomm_count:]
    sdf = df.iloc[idx].sort_values(by='overall_rating', ascending=False)
    return sdf

**Function call**

### Takes course title as an input parameter and returns a list of recommended courses that are relevant to the provided title.

In [None]:
#search for anything like python or you need a Machine learning course etc etc
recommend_by_course_title('python')

Unnamed: 0,course_title,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled,overall_rating
684,Python Data Structures,COURSE,0.941176,Mixed,0.50513,0.657421
487,Introduction to Data Science in Python,COURSE,0.705882,Intermediate,0.46892,0.563503
687,Python for Data Science and AI,COURSE,0.764706,Beginner,0.20338,0.321305
570,Machine Learning with Python,COURSE,0.823529,Intermediate,0.14303,0.243729
682,Python Basics,COURSE,0.882353,Beginner,0.13096,0.228069
188,Data Analysis with Python,COURSE,0.823529,Beginner,0.13096,0.225983
391,Google IT Automation with Python,PROFESSIONAL CERTIFICATE,0.823529,Beginner,0.110441,0.194762
203,Data Visualization with Python,COURSE,0.764706,Intermediate,0.077852,0.141316
513,Introduction to Scripting in Python,SPECIALIZATION,0.823529,Beginner,0.057333,0.107202
752,Statistics with Python,SPECIALIZATION,0.764706,Beginner,0.039228,0.074627
