# Coursera

##  Introduction
Tilto, Inc. is a Lithuanian company, registered in the United States. Tilto is planning to become a company like Coursera which partners with more than 200 leading universities and companies to bring flexible, affordable, job-relevant online learning to individuals and organizations worldwide. Tilto means bridge in Lithuanian and the company plans to become a bridge between citizens of the world and their potential as human beings. 

Coursera was founded by Daphne Koller and Andrew Ng in 2012 with a vision of providing life-transforming learning experiences to learners around the world. Today, Coursera is a global online learning platform that offers anyone, anywhere, access to online courses and degrees from leading universities and companies. 

Coursera offers a range of learning opportunities from hands-on projects and courses to job-ready certificates and degree programs. 82 million learners, 100+ Fortune 500 companies, and more than 6,000 campuses, businesses, and governments come to Coursera to access world-class learningâ€”anytime, anywhere.
 
Coursera received B Corp certification in February 2021, which means that they have a legal duty not only to their shareholders, but to also make a positive impact on society and continue to reduce barriers to world-class education for all. 

Titlo has a vision to become an international company that impacts the world by making its citizens realize their innate potentials. They wish to understand what makes Coursera successful and what has not worked. They want to do what has worked and avoid what has not. Furthermore, they want to become a better company than Coursera. 

For this analysis, I will work with a dataset provided by Coursea and obtained from Kaggle to help Tilto, Inc achieve their vision.

## Goals

The goal of this project is to analyze Coursera's current offering and advise the leadership of Tilto, Inc on how to move forward with their vision. This analysis will answer the following questions:

**Organizations**
1. Which learning organizations have been most successful with learners?

**Enrollment**
1. How many students can Tilto Inc project to attract?

**Certificates**
1. How does offereing certificates affect learner satisfaction?
2. Which type of certificate is more beneficial?
3. Which type of certificate attracts more learners?

**Ratings**
1. Which are the highest rated courses by learners?
2. What factors affect a learner to give a higher rating to a course?

**Difficulty**
1. Which level of courses are more popular among learners?
2. How does difficulty level and offering certificates interrelate?

**Imrpovements**
1. How can Tilto Inc become a better organization than Coursera?

## Importing Libraries and Loading Data

### Importing Libraries

In [59]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings # Supresses FutureWarning that are unnecessary.

### Loading Data in Pandas

The data is a csv file that I have dowloaded from Kaggle. In this section, I create a pandas dataframe object so I can work with the data.

In [60]:
coursera = pd.read_csv('C:\py\Projects\TuringCollege\Coursera\DataSet\coursera.csv', index_col = 0)

## Basic Information

In this section, I will display the following information about this dataset:

1. Number of rows and columns
2. Total number of data enteries
3. The first 5 rows
4. The data types in this dataset

### Number of Rows and Columns 

This dataset is made of 891 rows and 6 columns.

In [61]:
coursera.shape

(891, 6)

### Total Number of Entries 

This dataset is made of 5346 data enteries.

In [62]:
coursera.size

5346

### The First Five Rows

In [63]:
pd.set_option("display.max.columns", None) 
coursera.sort_index(inplace=True)
coursera.head()

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,IBM Data Science,IBM,PROFESSIONAL CERTIFICATE,5,Beginner,480k
1,Introduction to Data Science,IBM,SPECIALIZATION,5,Beginner,310k
2,The Science of Well-Being,Yale University,COURSE,5,Mixed,2.5m
3,Python for Everybody,University of Michigan,SPECIALIZATION,5,Beginner,1.5m
4,Google IT Support,Google,PROFESSIONAL CERTIFICATE,5,Beginner,350k


### Data Types

All data types of this dataset are string except for course_rating.

In [64]:
coursera.dtypes

course_title                 object
course_organization          object
course_Certificate_type      object
course_rating               float64
course_difficulty            object
course_students_enrolled     object
dtype: object

## Data Cleaning

### Missing Values
This dataset has no missing values

In [65]:
coursera.isnull().sum()

course_title                0
course_organization         0
course_Certificate_type     0
course_rating               0
course_difficulty           0
course_students_enrolled    0
dtype: int64

### Duplicate Values

This dataset has no duplicate values.

In [66]:
coursera[coursera.duplicated(keep = False)].sum()

course_title               0
course_organization        0
course_Certificate_type    0
course_rating              0
course_difficulty          0
course_students_enrolled   0
dtype: float64

### Modification of Column Names

I modified the name of each column to be clearer and more attractive.

In [67]:
coursera.rename(columns = {'course_title':'Title', 
                           'course_organization':'Organization', 
                           'course_Certificate_type':'Certificate',
                           'course_rating':'Rating',
                          'course_difficulty':'Difficulty',
                          'course_students_enrolled':'Enrollment'}, inplace = True)

In [68]:
coursera.head()

Unnamed: 0,Title,Organization,Certificate,Rating,Difficulty,Enrollment
0,IBM Data Science,IBM,PROFESSIONAL CERTIFICATE,5,Beginner,480k
1,Introduction to Data Science,IBM,SPECIALIZATION,5,Beginner,310k
2,The Science of Well-Being,Yale University,COURSE,5,Mixed,2.5m
3,Python for Everybody,University of Michigan,SPECIALIZATION,5,Beginner,1.5m
4,Google IT Support,Google,PROFESSIONAL CERTIFICATE,5,Beginner,350k


### Modification of Certificate Column

I modified the Certificate column so the text is in title case. I also removed the uncessary text, 'Certificate' from some of the enteries. 

In [69]:
coursera['Certificate'] = coursera['Certificate'].str.title()
coursera['Certificate'] = coursera['Certificate'].str.replace(r'Certificate', '')

In [70]:
coursera.head()

Unnamed: 0,Title,Organization,Certificate,Rating,Difficulty,Enrollment
0,IBM Data Science,IBM,Professional,5,Beginner,480k
1,Introduction to Data Science,IBM,Specialization,5,Beginner,310k
2,The Science of Well-Being,Yale University,Course,5,Mixed,2.5m
3,Python for Everybody,University of Michigan,Specialization,5,Beginner,1.5m
4,Google IT Support,Google,Professional,5,Beginner,350k


### Modification of Enrollment Column

I modified this column so numbers appear as comma-separated integers. 

In [71]:
# Extracted the symbol 'k' or 'm' from the text in the Enrollment column. Created a new column called 'Symbol' for it.

coursera['Symbol'] = coursera['Enrollment'].str[-1:]

In [72]:
# Extractd the digits from the Enrollment column and set the digits as float.

pd.set_option("max_rows", None)
coursera['Enrollment'] = coursera['Enrollment'].str.extract(r'(\d+[.\d]*)').astype(float)

In [73]:
# If the 'Symbol' column shows 'k', the number in the Enrollment column is multiplied by 1000.
# If it shows 'm', the number is multiplied by 1,000,000.

coursera.loc[coursera['Symbol'] == 'k', 'Multiple'] = 1000
coursera.loc[coursera['Symbol'] == 'm', 'Multiple'] = 1000000
coursera['Multiple'] = coursera['Multiple'].astype(int)

In [74]:
# I created a new column called, 'Enrolled' with the product of the multiplication as explained above.
# This column is formated to display numbers with comma separator on each 3 digits from the right.

coursera['Enrolled'] = coursera['Enrollment'] * coursera['Multiple']
coursera['Enrolled'] = coursera['Enrolled'].astype(float)
pd.options.display.float_format = '{:,.0f}'.format

In [75]:
# Finally, I deleted the unnecssary columns after the operations above. 

coursera = coursera.drop(['Symbol', 'Multiple', 'Enrollment'], axis = 1)

In [76]:
coursera.head(10)

Unnamed: 0,Title,Organization,Certificate,Rating,Difficulty,Enrolled
0,IBM Data Science,IBM,Professional,5,Beginner,480000
1,Introduction to Data Science,IBM,Specialization,5,Beginner,310000
2,The Science of Well-Being,Yale University,Course,5,Mixed,2500000
3,Python for Everybody,University of Michigan,Specialization,5,Beginner,1500000
4,Google IT Support,Google,Professional,5,Beginner,350000
5,Deep Learning,deeplearning.ai,Specialization,5,Intermediate,690000
6,Machine Learning,Stanford University,Course,5,Mixed,3200000
7,Business Foundations,University of Pennsylvania,Specialization,5,Beginner,510000
8,Applied Data Science,IBM,Specialization,5,Beginner,220000
9,Cloud Engineering with Google Cloud,Google Cloud,Professional,5,Intermediate,310000


## Descriptive Analysis

In this section, I provide basic statistical details summarize the data on 2 numerical features of this dataset.

1.  **Rating:** The values of the Energy feature range from 0 to 1. Higher value means that the songs is more energetic.
2.  **Enrolled:** The values of the Danceability feature range from 0 to 1. Higher value means that it is easier to dance.

In [78]:
coursera.describe()

Unnamed: 0,Rating,Enrolled
count,891,891
mean,5,90552
std,0,181936
min,3,1500
25%,5,17500
50%,5,42000
75%,5,99500
max,5,3200000


## Outliers

In [None]:
Ignores FutureWarning message that appears with the code below.

warnings.simplefilter(action = "ignore", category = FutureWarning) 

Q1 = top_fifty.quantile(0.25)
Q3 = top_fifty.quantile(0.75)
IQR = Q3 - Q1

outliers_df = (top_fifty < (Q1 - 1.5 * IQR)) | (
    top_fifty > (Q3 + 1.5 * IQR)
)

((top_fifty < (Q1 - 1.5 * IQR)) | (top_fifty > (Q3 + 1.5 * IQR))).sum()

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(20, 20))

sns.boxplot(ax=axes[0, 0], data = top_fifty, x = top_fifty ['acousticness'])
sns.boxplot(ax=axes[0, 1], data = top_fifty, x = top_fifty ['danceability'])
sns.boxplot(ax=axes[0, 2], data = top_fifty, x = top_fifty ['duration_ms'])
sns.boxplot(ax=axes[1, 0], data = top_fifty, x = top_fifty ['instrumentalness'])
sns.boxplot(ax=axes[1, 1], data = top_fifty, x = top_fifty ['liveness'])
sns.boxplot(ax=axes[1, 2], data = top_fifty, x = top_fifty ['loudness'])
sns.boxplot(ax=axes[2, 0], data = top_fifty, x = top_fifty ['speechiness'])

fig.delaxes(ax = axes[2,1]) 
fig.delaxes(ax = axes[2,2]) 

## Exploratory Data Analysis (EDA)

### Organization vs Enrollment

In [83]:
organization_enrollment = coursera.groupby("Organization")["Enrolled"].sum()
organization_enrollment.sort_values(ascending=False, inplace = True)
organization_enrollment_df = pd.DataFrame(organization_enrollment)
organization_enrollment_df.head(10)

Unnamed: 0_level_0,Enrolled
Organization,Unnamed: 1_level_1
University of Michigan,7437700
University of Pennsylvania,5501300
Stanford University,4854000
"University of California, Irvine",4326000
Johns Hopkins University,4298900
Duke University,3967600
Yale University,3952000
IBM,2956400
deeplearning.ai,2863400
Google Cloud,2604300


### Rating vs Enrollment

In [84]:
rating_enrollment = coursera.groupby("Rating")["Enrolled"].sum()
rating_enrollment.sort_index(ascending=False, inplace = True)
rating_enrollment

Unnamed: 0_level_0,Enrolled
Rating,Unnamed: 1_level_1
5,3100
5,11639300
5,22335600
5,20574900
5,15783200
4,5962000
4,2782600
4,624000
4,637200
4,34000


### Organization vs Rating

In [None]:
ratings_organization = coursera.groupby("Organization")["Rating"].sum()
ratings_organization.sort_values(ascending=False, inplace = True)
ratings_organization.head(10)

### Difficulty vs Enrollment

In [None]:
difficulty_enrollment = coursera.groupby("Difficulty")["Enrolled"].sum()
difficulty_enrollment.sort_values(ascending=False, inplace = True)
difficulty_enrollment

### Certificate vs Enrollment

In [None]:
certificate_enrollment = coursera.groupby("Certificate")["Enrolled"].sum()
certificate_enrollment.sort_values(ascending=False, inplace = True)
certificate_enrollment

In [None]:
size = 25
pad = 25

params = {'legend.fontsize': 'large',
          'figure.figsize': (20,12),
          'axes.labelsize': size,
          'axes.titlesize': size,
          'xtick.labelsize': size*0.75,
          'ytick.labelsize': size*0.75,
          'axes.titlepad': pad,
          'axes.labelpad': pad,
          'font.family':'times new roman',
         }

plt.rcParams.update(params)

certificate = coursera['Certificate'].values
enrollment = coursera['Enrolled'].values
plt.xlabel('Certificate')
plt.ylabel('Enrolled')
plt.title('Number of Students Enrolled for Each Certificate Type')

plt.bar(certificate,enrollment, width = 0.5, color = ('mediumseagreen'))
plt.show();

In [None]:
plt.bar(pos, popularity, align = 'center')
plt.xticks(pos, languages)
plt.ylabel('% Popularity')
plt.title('Top 5 Languages for Math & Data \nby % popularity on Stack Overflow', alpha=0.8)

plt.show()

In [None]:
coursera.head()