# IR model applied to Coursera dataset

### Importing essential libraries for data analysis and visualization
- **NumPy** for numerical operations in linear algebra
- **Pandas** for data manipulation and analysis, including CSV file input/output
- **Matplotlib** for basic data visualization
- **Seaborn** for statistical data visualization


In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # for data visualization purposes
import seaborn as sns # for statistical data visualization

%matplotlib inline

ds = pd.read_csv('CourseraDataset-Clean.csv')


### retrieving the dimension of a Pandas DataFrame shows 8370 rows and 13 columns

In [None]:
ds.shape

In [None]:
ds.head()

Pandas method ds.info() provides information about the DataFrame's structure

In [None]:
ds.info()

The code below aims to identify, count, and display the names of categorical variables in the DataFrame, along with a preview of the first few rows of those variables.

In [None]:
# find categorical variables

categorical = [var for var in ds.columns if ds[var].dtype=='O']

print('There are {} categorical variables\n'.format(len(categorical)))

print('The categorical variables are :\n\n', categorical)

# view the categorical variables

ds[categorical].head()

In [None]:
# check missing values in categorical variables

ds[categorical].isnull().sum()

### The code iterates through each categorical variable in the DataFrame ds and prints the frequency counts of unique values for each variable.
------------------------------------------------------------------------------------------------------------------------------

### The output provides a detailed breakdown of the frequency of different values within each categorical variable, offering insights into the distribution of course-related information.

In [None]:
# view frequency counts of values in categorical variables

for var in categorical: 
    
    print(ds[var].value_counts())

In [None]:
for var in categorical: 
    print(ds[var].value_counts() / float(len(ds)))

In [None]:
ds['Level'].value_counts()

In [None]:
# replace 'Not specified' values in workclass variable with `NaN`

ds['Level'].replace('Not specified', np.NaN, inplace=True)

In [None]:
ds['Level'].value_counts()

The code **ds[categorical].isnull().sum()** calculates and prints the count of missing values for each categorical variable in the DataFrame ds

In [None]:
ds[categorical].isnull().sum()

The code **ds['Schedule'].value_counts()** calculates and prints the frequency count of unique values in the 'Schedule' column of the DataFrame ds

In [None]:
ds['Schedule'].value_counts()

*This output below provides information about the diversity of unique labels within each categorical variable. It shows the number of distinct categories for each variable, which can be useful for understanding the variety and granularity of the data in these categorical columns*

In [None]:
for var in categorical:
    
    print(var, ' contains ', len(ds[var].unique()), ' labels')

In [None]:
# find numerical variables

numerical = [var for var in ds.columns if ds[var].dtype!='O']

print('There are {} numerical variables\n'.format(len(numerical)))

print('The numerical variables are :', numerical)

In [None]:
ds[numerical].head()

In [None]:
ds[numerical].isnull().sum()

In [None]:
X = ds.drop(['Schedule', 'Course Url'], axis=1)

y = ds['Schedule']

In [None]:
# split X and y into training and testing sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
# check the shape of X_train and X_test

X_train.shape, X_test.shape

In [None]:
# check data types in X_train

X_train.dtypes

In [None]:
# display categorical variables

categorical = [col for col in X_train.columns if X_train[col].dtypes == 'O']

categorical

In [None]:
# display numerical variables

numerical = [col for col in X_train.columns if X_train[col].dtypes != 'O']

numerical

In [None]:
# print percentage of missing values in the categorical variables in training set

X_train[categorical].isnull().mean()

In [None]:
# print categorical variables with missing data

for col in categorical:
    if X_train[col].isnull().mean()>0:
        print(col, (X_train[col].isnull().mean()))

In [None]:
# impute missing categorical variables with most frequent value

for ds2 in [X_train, X_test]:
    ds2['Level'].fillna(X_train['Level'].mode()[0], inplace=True)    
    ds2['Modules'].fillna(X_train['Modules'].mode()[0], inplace=True)
    ds2['Instructor'].fillna(X_train['Instructor'].mode()[0], inplace=True)


In [None]:
# check missing values in categorical variables in X_train

X_train[categorical].isnull().sum()

In [None]:
# check missing values in categorical variables in X_test

X_test[categorical].isnull().sum()

In [None]:
# check missing values in X_train

X_train.isnull().sum()

In [None]:
# check missing values in X_test

X_test.isnull().sum()

In [None]:
# print categorical variables

categorical

*This output provides a view of the first five rows of the specified categorical columns in your dataset, showcasing details about different online courses, including their titles, difficulty levels, learning objectives, skills gained, course modules, instructors, offering institutions, and associated keywords.*

In [None]:
X_train[categorical].head()

In [None]:
# import category encoders

import category_encoders as ce

In [None]:
encoder = ce.OneHotEncoder(cols=['Course Title', 'Level', 'What you will learn', 'Skill gain', 'Modules', 
                                 'Instructor', 'Offered By', 'Keyword'])

X_train = encoder.fit_transform(X_train)

X_test = encoder.transform(X_test)

In [None]:
X_train.head()

In [None]:
X_train.shape

In [None]:
X_test.head()

In [None]:
X_test.shape

In [None]:
cols = X_train.columns

### This code snippet is performing robust scaling on both the training and test datasets using the RobustScaler. This type of scaling is beneficial when dealing with datasets that may contain outliers, as it uses the robust statistics (median and interquartile range) to scale the features, making the scaling less sensitive to extreme values.

In [None]:
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [None]:
X_train = pd.DataFrame(X_train, columns=[cols])

In [None]:
X_test = pd.DataFrame(X_test, columns=[cols])

In [None]:
X_train.head()

In [None]:
# train a Gaussian Naive Bayes classifier on the training set
from sklearn.naive_bayes import GaussianNB


# instantiate the model
gnb = GaussianNB()


# fit the model
gnb.fit(X_train, y_train)

In [None]:
y_pred = gnb.predict(X_test)

y_pred

In [None]:
from sklearn.metrics import accuracy_score

print('Model accuracy score: {0:0.4f}'. format(accuracy_score(y_test, y_pred)))

In [None]:
y_pred_train = gnb.predict(X_train)

y_pred_train

In [None]:
print('Training-set accuracy score: {0:0.4f}'. format(accuracy_score(y_train, y_pred_train)))

In [None]:
# print the scores on training and test set

print('Training set score: {:.4f}'.format(gnb.score(X_train, y_train)))

print('Test set score: {:.4f}'.format(gnb.score(X_test, y_test)))

In [None]:
# check class distribution in test set

y_test.value_counts()


In [None]:
# check null accuracy score

null_accuracy = (2412/(2412+99))

print('Null accuracy score: {0:0.4f}'. format(null_accuracy))

In [None]:
# Print the Confusion Matrix and slice it into four pieces

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

print('Confusion matrix\n\n', cm)

print('\nTrue Positives(TP) = ', cm[0,0])

print('\nTrue Negatives(TN) = ', cm[1,1])

print('\nFalse Positives(FP) = ', cm[0,1])

print('\nFalse Negatives(FN) = ', cm[1,0])

In [None]:
# visualize confusion matrix with seaborn heatmap

cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='YlGnBu')

The code snippet print(classification_report(y_test, y_pred)) is using scikit-learn's classification_report function to generate a text-based summary report of the classification performance on the test set. Here's what it typically includes:

- Precision: The ratio of correctly predicted positive observations to the total predicted positives. Precision is a measure of the accuracy of positive predictions.

- Recall (Sensitivity): The ratio of correctly predicted positive observations to the all observations in the actual class. Recall measures the ability of the model to capture all the relevant cases.

- F1-Score: The weighted average of precision and recall. It is a metric that combines both precision and recall into a single value.

- Support: The number of actual occurrences of the class in the specified dataset.

- Accuracy: The ratio of correctly predicted observation to the total observations. It is a measure of overall correctness.

The output of classification_report gives you insights into the model's performance for each class, as well as an overall summary.

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

In [None]:
TP = cm[0,0]
TN = cm[1,1]
FP = cm[0,1]
FN = cm[1,0]

In [None]:
# print classification accuracy

classification_accuracy = (TP + TN) / float(TP + TN + FP + FN)

print('Classification accuracy : {0:0.4f}'.format(classification_accuracy))

In [None]:
# print classification error

classification_error = (FP + FN) / float(TP + TN + FP + FN)

print('Classification error : {0:0.4f}'.format(classification_error))

In [None]:
# print precision score

precision = TP / float(TP + FP)


print('Precision : {0:0.4f}'.format(precision))

In [None]:
recall = TP / float(TP + FN)

print('Recall or Sensitivity : {0:0.4f}'.format(recall))

In [None]:
true_positive_rate = TP / float(TP + FN)


print('True Positive Rate : {0:0.4f}'.format(true_positive_rate))

In [None]:
false_positive_rate = FP / float(FP + TN)


print('False Positive Rate : {0:0.4f}'.format(false_positive_rate))

In [None]:
specificity = TN / (TN + FP)

print('Specificity : {0:0.4f}'.format(specificity))

In [None]:
# print the first 10 predicted probabilities of two classes- 0 and 1

y_pred_prob = gnb.predict_proba(X_test)[0:10]

y_pred_prob

In [None]:
# store the probabilities in dataframe

y_pred_prob_df = pd.DataFrame(data=y_pred_prob, columns=['Prob of Hands-on learning', 'Prob of Flexible learning'])

y_pred_prob_df

In [None]:
# print the first 10 predicted probabilities for class 1 - Probability of >50K

gnb.predict_proba(X_test)[0:10, 1]

In [None]:
# store the predicted probabilities for class 1 - Probability of >50K

y_pred1 = gnb.predict_proba(X_test)[:, 1]

## Plotting histogram of predicted probabilities of Flexible learning:

In [None]:
# plot histogram of predicted probabilities


# adjust the font size 
plt.rcParams['font.size'] = 12


# plot histogram with 10 bins
plt.hist(y_pred1, bins = 10)


# set the title of predicted probabilities
plt.title('Histogram of predicted probabilities of Flexible learning')


# set the x-axis limit
plt.xlim(0,1)


# set the title
plt.xlabel('Predicted probabilities of Flexible learning')
plt.ylabel('Frequency')

In [None]:
# plot ROC Curve

from sklearn.metrics import roc_curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred1, pos_label = 'Flexible schedule')

plt.figure(figsize=(6,4))

plt.plot(fpr, tpr, linewidth=2)

plt.plot([0,1], [0,1], 'k--' )

plt.rcParams['font.size'] = 12

plt.title('ROC curve for Gaussian Naive Bayes Classifier for Predicting Hands-on learning Schedule')

plt.xlabel('False Positive Rate (1 - Specificity)')

plt.ylabel('True Positive Rate (Sensitivity)')

plt.show()

In [None]:
# compute ROC AUC

from sklearn.metrics import roc_auc_score

ROC_AUC = roc_auc_score(y_test, y_pred1)

print('ROC AUC : {:.4f}'.format(ROC_AUC))

In [None]:
# calculate cross-validated ROC AUC 

from sklearn.model_selection import cross_val_score

Cross_validated_ROC_AUC = cross_val_score(gnb, X_train, y_train, cv=5, scoring='roc_auc').mean()

print('Cross validated ROC AUC : {:.4f}'.format(Cross_validated_ROC_AUC))

In [None]:
# Applying 10-Fold Cross Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(gnb, X_train, y_train, cv = 10, scoring='accuracy')

print('Cross-validation scores:{}'.format(scores))

In [None]:
# compute Average cross-validation score

print('Average cross-validation score: {:.4f}'.format(scores.mean()))