# 02Coursera

##  Introduction
Tilto, Inc. is a Lithuanian company, registered in the United States. Tilto is planning to become a company like Coursera which partners with more than 200 leading universities and companies to bring flexible, affordable, job-relevant online learning to individuals and organizations worldwide. Tilto means bridge in Lithuanian and the company plans to create a bridge between citizens of the world and their human potentials. 

Coursera was founded by Daphne Koller and Andrew Ng in 2012 with a vision of providing life-transforming learning experiences to learners around the world. Today, Coursera is a global online learning platform that offers anyone, anywhere, access to online courses and degrees from leading universities and companies. 
 
Titlo has a similar vision as Coursera. Tilto plans to become an international company that impacts the world by making its citizens realize their innate potentials. Therefore, its leaders wish to understand what makes Coursera successful. They also want to know what has not worked and become an even better organization than Coursera. 

For this analysis, I will work with a dataset provided by Coursea and obtained from Kaggle to help Tilto, Inc achieve their vision.

## Goals

The goal of this project is to analyze Coursera's current offering and advise Tilto's leadership on how to move forward with their vision. This analysis will answer the following questions:

**Organizations**
1. What are the top ten organization at Coursera?
2. How many course titles does each organization have?
2. What are the top ten course titles taught by each organization?

**Course**
1. How many courses are at Coursera?
2. What are the top ten course titles at Coursera?


**Difficulty**
1. What are the different difficulty levels at Coursea?
2. Which level of courses are more popular among learners?
3. Which level of difficlty attracts fewest learners?

**Certificates**
1. What types of certificates are provided at Coursera?
2. What number of each type of certificate are offered?
3. How does offereing certificates affect learner satisfaction?
4. Which type of certificate attracts attracts learners?

**Enrollment**
1. How many students in total have enrolled at Coursera?
2. What is the corrleation between difficulty of a course and number of enrolled students?

**Ratings**
1. Which are the top ten highest rated courses by students?
2. Which are the top ten highest rated organizations? 
3. Which are the bottom ten rated courses?
4. Which are the bottom ten rated organizations?

**Correlations**
1. What is the corrleations between difficulty level of a course and enrollment? 
2. What is the corrleations between difficulty level and rating?
2. What is the correlation between certificate type and enrollement? 
3. What is the correlation between certificate type and rating?

**Imrpovements**
1. How can Tilto become a better organization than Coursera?

## Importing Libraries and Loading Data

### Importing Libraries

In [1]:
%matplotlib inline

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings # Supresses FutureWarning that are unnecessary.

### Loading Data in Pandas

The data is a csv file that I have dowloaded from Kaggle. In this section, I create a pandas dataframe object so I can work with the data.

In [2]:
coursera = pd.read_csv('C:\py\Projects\TuringCollege\Coursera\DataSet\coursera.csv', index_col = 0, skipinitialspace = True)

## Basic Information

In this section, I will display the following information about this dataset:

1. Number of rows and columns
2. Total number of data enteries
3. The first 5 rows
4. The data types in this dataset

### Number of Rows and Columns 

This dataset is made of 891 rows and 6 columns.

In [3]:
coursera.shape

(891, 6)

### Total Number of Entries 

This dataset is made of 5346 data enteries.

In [4]:
coursera.size

5346

### The First Ten Rows

In [5]:
pd.set_option("display.max.columns", None) 
coursera.sort_index(inplace=True)
coursera

Unnamed: 0,course_title,course_organization,course_Certificate_type,course_rating,course_difficulty,course_students_enrolled
0,IBM Data Science,IBM,PROFESSIONAL CERTIFICATE,4.6,Beginner,480k
1,Introduction to Data Science,IBM,SPECIALIZATION,4.6,Beginner,310k
2,The Science of Well-Being,Yale University,COURSE,4.9,Mixed,2.5m
3,Python for Everybody,University of Michigan,SPECIALIZATION,4.8,Beginner,1.5m
4,Google IT Support,Google,PROFESSIONAL CERTIFICATE,4.8,Beginner,350k
...,...,...,...,...,...,...
886,Understanding Modern Finance,American Institute of Business and Economics,SPECIALIZATION,4.4,Intermediate,11k
887,Object-Oriented Design,University of Alberta,COURSE,4.7,Intermediate,33k
888,Aprende a programar con Python,Universidad Austral,SPECIALIZATION,4.2,Beginner,6.6k
889,Погружение в Python,Moscow Institute of Physics and Technology,COURSE,4.7,Intermediate,45k


### Data Types

All data types of this dataset are string except for course_rating.

In [6]:
coursera.dtypes

course_title                 object
course_organization          object
course_Certificate_type      object
course_rating               float64
course_difficulty            object
course_students_enrolled     object
dtype: object

## Data Cleaning

### Missing Values
This dataset has no missing values

In [7]:
coursera.isnull().sum()

course_title                0
course_organization         0
course_Certificate_type     0
course_rating               0
course_difficulty           0
course_students_enrolled    0
dtype: int64

### Duplicate Values

This dataset has no duplicate values.

In [8]:
coursera[coursera.duplicated(keep = False)].sum()

course_title                0.0
course_organization         0.0
course_Certificate_type     0.0
course_rating               0.0
course_difficulty           0.0
course_students_enrolled    0.0
dtype: float64

### Modification of Column Names

I modified the name of each column to be clearer and more attractive.

In [9]:
coursera.rename(columns = {'course_title':'Course', 
                           'course_organization':'Organization', 
                           'course_Certificate_type':'Certificate',
                           'course_rating':'Rating',
                          'course_difficulty':'Difficulty',
                          'course_students_enrolled':'Students'}, inplace = True)

In [10]:
coursera.head()

Unnamed: 0,Course,Organization,Certificate,Rating,Difficulty,Students
0,IBM Data Science,IBM,PROFESSIONAL CERTIFICATE,4.6,Beginner,480k
1,Introduction to Data Science,IBM,SPECIALIZATION,4.6,Beginner,310k
2,The Science of Well-Being,Yale University,COURSE,4.9,Mixed,2.5m
3,Python for Everybody,University of Michigan,SPECIALIZATION,4.8,Beginner,1.5m
4,Google IT Support,Google,PROFESSIONAL CERTIFICATE,4.8,Beginner,350k


### Modification of Certificate Column

I modified the Certificate column so the text is in title case. I also removed the uncessary text, 'Certificate' from some of the enteries. 

In [11]:
# Makes each string into title case.
coursera['Certificate'] = coursera['Certificate'].str.title()

# Removes text 'Certificate' from any value in this column.
coursera['Certificate'] = coursera['Certificate'].str.replace(r'Certificate', '') 

# Removes ALL white spaces from values of this column.
coursera['Certificate'].str.strip()

coursera.head(10)

Unnamed: 0,Course,Organization,Certificate,Rating,Difficulty,Students
0,IBM Data Science,IBM,Professional,4.6,Beginner,480k
1,Introduction to Data Science,IBM,Specialization,4.6,Beginner,310k
2,The Science of Well-Being,Yale University,Course,4.9,Mixed,2.5m
3,Python for Everybody,University of Michigan,Specialization,4.8,Beginner,1.5m
4,Google IT Support,Google,Professional,4.8,Beginner,350k
5,Deep Learning,deeplearning.ai,Specialization,4.8,Intermediate,690k
6,Machine Learning,Stanford University,Course,4.9,Mixed,3.2m
7,Business Foundations,University of Pennsylvania,Specialization,4.7,Beginner,510k
8,Applied Data Science,IBM,Specialization,4.6,Beginner,220k
9,Cloud Engineering with Google Cloud,Google Cloud,Professional,4.7,Intermediate,310k


### Modification of Enrollment Column

I modified this column so numbers appear as comma-separated integers. 

In [12]:
# Extractes the symbol 'k' or 'm' from the text in the Enrollment column. Created a new column called 'Symbol' for it.
coursera['Symbol'] = coursera['Students'].str[-1:]

In [13]:
# Extractes the digits from the Enrollment column and set the digits as float.
pd.set_option("max_rows", None)
coursera['Students'] = coursera['Students'].str.extract(r'(\d+[.\d]*)').astype(float)

In [14]:
# If the 'Symbol' column shows 'k', the number in the Enrollment column is multiplied by 1000.
# If it shows 'm', the number is multiplied by 1,000,000.
coursera.loc[coursera['Symbol'] == 'k', 'Multiple'] = 1000
coursera.loc[coursera['Symbol'] == 'm', 'Multiple'] = 1000000
coursera['Multiple'] = coursera['Multiple'].astype(int)

In [15]:
# Creates a new column called, 'Enrolled' with the product of the multiplication above.
# Formates the column to display numbers with comma separator on each 3 digits from the right.
coursera['Enrollment'] = coursera['Students'] * coursera['Multiple']
coursera['Enrollment'] = coursera['Enrollment'].astype(int)

In [16]:
# Deletes the unnecssary columns after the operations above. 
coursera = coursera.drop(['Symbol', 'Multiple', 'Students'], axis = 1)

In [17]:
coursera.head(10)

Unnamed: 0,Course,Organization,Certificate,Rating,Difficulty,Enrollment
0,IBM Data Science,IBM,Professional,4.6,Beginner,480000
1,Introduction to Data Science,IBM,Specialization,4.6,Beginner,310000
2,The Science of Well-Being,Yale University,Course,4.9,Mixed,2500000
3,Python for Everybody,University of Michigan,Specialization,4.8,Beginner,1500000
4,Google IT Support,Google,Professional,4.8,Beginner,350000
5,Deep Learning,deeplearning.ai,Specialization,4.8,Intermediate,690000
6,Machine Learning,Stanford University,Course,4.9,Mixed,3200000
7,Business Foundations,University of Pennsylvania,Specialization,4.7,Beginner,510000
8,Applied Data Science,IBM,Specialization,4.6,Beginner,220000
9,Cloud Engineering with Google Cloud,Google Cloud,Professional,4.7,Intermediate,310000


### Modification of Order of Columns

I modified order of the columns to make it more readable and workable. 

In [18]:
coursera = coursera[['Organization', 'Course', 'Difficulty', 'Certificate', 'Enrollment', 'Rating']]
coursera.head(10)

Unnamed: 0,Organization,Course,Difficulty,Certificate,Enrollment,Rating
0,IBM,IBM Data Science,Beginner,Professional,480000,4.6
1,IBM,Introduction to Data Science,Beginner,Specialization,310000,4.6
2,Yale University,The Science of Well-Being,Mixed,Course,2500000,4.9
3,University of Michigan,Python for Everybody,Beginner,Specialization,1500000,4.8
4,Google,Google IT Support,Beginner,Professional,350000,4.8
5,deeplearning.ai,Deep Learning,Intermediate,Specialization,690000,4.8
6,Stanford University,Machine Learning,Mixed,Course,3200000,4.9
7,University of Pennsylvania,Business Foundations,Beginner,Specialization,510000,4.7
8,IBM,Applied Data Science,Beginner,Specialization,220000,4.6
9,Google Cloud,Cloud Engineering with Google Cloud,Intermediate,Professional,310000,4.7


## Descriptive Statistics

In this section, I provide basic statistical details summarize the data on 2 numerical features of this dataset.

1.  **Rating:** The subjective 1 - 5 score that students give at the end of a course based on various criterea.
2.  **Enrollment:** The number of students enrolled.

In [19]:
coursera.describe()

Unnamed: 0,Enrollment,Rating
count,891.0,891.0
mean,90552.08,4.677329
std,181936.5,0.162225
min,1500.0,3.3
25%,17500.0,4.6
50%,42000.0,4.7
75%,99500.0,4.8
max,3200000.0,5.0


## Exploratory Data Analysis (EDA)

### Total Enrollment

In this section, I will answer the following question:

1. According to this data, what is the total enrollment at Coursera?

In [89]:
coursera['Enrollment'].sum()

80681900

### Organizations

In this section, I will answer the following questions:

1. Which ten organizations have the highest enrollment?
2. Which ten organizations have the highest number of courses?

In [80]:
organizations = coursera.pivot_table('Enrollment', index = 'Organization', aggfunc='sum', 
                                    margins=True, margins_name = 'Total Enrollment')
organizations.sort_values('Enrollment', ascending = False).head(10)

Unnamed: 0_level_0,Enrollment
Organization,Unnamed: 1_level_1
Total Enrollment,80681900
University of Michigan,7437700
University of Pennsylvania,5501300
Stanford University,4854000
"University of California, Irvine",4326000
Johns Hopkins University,4298900
Duke University,3967600
Yale University,3952000
IBM,2956400
deeplearning.ai,2863400


In [52]:
organizations = coursera.pivot_table('Course', index = 'Organization', aggfunc='count', dropna=True)
organizations.sort_values('Course', ascending = False).head(10)

Unnamed: 0_level_0,Course
Organization,Unnamed: 1_level_1
University of Pennsylvania,59
University of Michigan,41
Google Cloud,34
Duke University,28
Johns Hopkins University,28
"University of California, Irvine",27
University of Illinois at Urbana-Champaign,22
IBM,22
"University of California, Davis",21
University of Colorado Boulder,19


### Courses

In this section, I will answer the following questions:

1. How many courses are taught at Coursera?
2. What are the top ten course titles at Coursera?

In [64]:
len(coursera['Course'].unique().tolist())

888

In [84]:
courses = coursera.pivot_table('Enrollment', index = 'Course', aggfunc='sum')
courses.sort_values('Enrollment', ascending = False).head(10)

Unnamed: 0_level_0,Enrollment
Course,Unnamed: 1_level_1
Machine Learning,3490000
The Science of Well-Being,2500000
Python for Everybody,1500000
Programming for Everybody (Getting Started with Python),1300000
Data Science,830000
Career Success,790000
English for Career Development,760000
Successful Negotiation: Essential Strategies and Skills,750000
Data Science: Foundations using R,740000
Deep Learning,690000


### Difficulty Levels

In this section, I will answer the following questions:

1. What are the different difficulty levels at Coursea?
2. Which level of courses are more popular among learners?

In [90]:
coursera['Difficulty'].unique().tolist()

['Beginner', 'Mixed', 'Intermediate', 'Advanced']

In [91]:
difficulty = coursera.pivot_table('Enrollment', index = 'Difficulty', aggfunc='sum', 
                                  margins=True, margins_name = 'Total Enrollment')
difficulty.sort_values('Enrollment', ascending = False).head(10)

Unnamed: 0_level_0,Enrollment
Difficulty,Unnamed: 1_level_1
Total Enrollment,80681900
Beginner,39921800
Mixed,24989400
Intermediate,14506300
Advanced,1264400


### Certificates

In this section, I will answer the following questions:

1. What types of certificates are provided at Coursera?
2. What is the total number of each type of certificate offered?

In [86]:
coursera['Certificate'].unique().tolist()

['Professional ', 'Specialization', 'Course']

In [88]:
certificates = coursera.pivot_table('Enrollment', index = 'Certificate', aggfunc='sum', 
                                    margins=True, margins_name = 'Total Enrollment')
certificates.sort_values('Enrollment', ascending = False).head(10)

Unnamed: 0_level_0,Enrollment
Certificate,Unnamed: 1_level_1
Total Enrollment,80681900
Course,51131300
Specialization,27262200
Professional,2288400


### Ratings

In this section, I will answer the following questions:

1. Which are the top ten highest rated courses by students?
2. Which are the bottom ten lowest rated courses?
3. Which are the top ten highest rated organizations? 
4. Which are the bottom ten lowest rated organizations?

In [92]:
ratings = coursera.pivot_table('Rating', index = 'Course')
ratings.sort_values('Rating', ascending = False).head(10)

Unnamed: 0_level_0,Rating
Course,Unnamed: 1_level_1
Infectious Disease Modelling,5.0
El Abogado del Futuro: Legaltech y la Transformación Digital del Derecho,5.0
Stories of Infection,4.9
Boosting Creativity for Innovation,4.9
"Brand Management: Aligning Business, Brand and Behaviour",4.9
Understanding Einstein: The Special Theory of Relativity,4.9
Bugs 101: Insect-Human Interactions,4.9
Build a Modern Computer from First Principles: From Nand to Tetris (Project-Centered Course),4.9
Introduction to Psychology,4.9
Everyday Parenting: The ABCs of Child Rearing,4.9


In [97]:
ratings = coursera.pivot_table('Rating', index = 'Course')
ratings.sort_values('Rating', ascending = False).tail(10)

Unnamed: 0_level_0,Rating
Course,Unnamed: 1_level_1
Optical Engineering,4.2
Foundations of Marketing Analytics,4.2
Instructional Design Foundations and Applications,4.2
How to Start Your Own Business,4.1
"Introduction to Trading, Machine Learning & GCP",4.0
Mathematics for Machine Learning: PCA,4.0
iOS App Development with Swift,3.9
Machine Learning for Trading,3.9
Machine Learning and Reinforcement Learning in Finance,3.7
How To Create a Website in a Weekend! (Project-Centered Course),3.3


In [98]:
ratings = coursera.pivot_table('Rating', index = 'Organization')
ratings.sort_values('Rating', ascending = False).head(10)

Unnamed: 0_level_0,Rating
Organization,Unnamed: 1_level_1
Hebrew University of Jerusalem,4.9
"Nanyang Technological University, Singapore",4.9
Universidade Estadual de Campinas,4.9
Crece con Google,4.9
London Business School,4.9
Google - Spectrum Sharing,4.9
ScrumTrek,4.9
Universidade de São Paulo,4.866667
The University of Chicago,4.85
Universidad de los Andes,4.82


In [99]:
ratings = coursera.pivot_table('Rating', index = 'Organization')
ratings.sort_values('Rating', ascending = False).tail(10)

Unnamed: 0_level_0,Rating
Organization,Unnamed: 1_level_1
Peter the Great St. Petersburg Polytechnic University,4.4
American Institute of Business and Economics,4.4
Icahn School of Medicine at Mount Sinai,4.4
Novosibirsk State University,4.4
The Linux Foundation,4.4
Luther College at the University of Regina,4.4
Unity,4.35
New York Institute of Finance,4.3
Tsinghua University,4.3
The State University of New York,4.275


### Correlation Among Features 

In this section, I will answer the following questions:

1. What is the corrleations between difficulty level of a course and enrollment? 
2. What is the corrleations between difficulty level and rating?
2. What is the correlation between certificate type and enrollement? 
3. What is the correlation between certificate type and rating?

#### Difficulty vs Enrollment

In [None]:
difficulty_enrollment = coursera.groupby("Difficulty")["Enrolled"].sum()
difficulty_enrollment.sort_values(ascending=False, inplace = True)
difficulty_enrollment_df = pd.DataFrame(difficulty_enrollment)
difficulty_enrollment_df.head(10)

#### Certificate vs Rating

#### Certificate vs Enrollment

In [None]:
certificate_enrollment = coursera.groupby("Certificate")["Enrolled"].sum()
certificate_enrollment.sort_values(ascending=False, inplace = True)
certificate_enrollment_df = pd.DataFrame(certificate_enrollment)
certificate_enrollment_df.head(10)

#### Organization vs Enrollment

In [None]:
organization_enrollment = coursera.groupby("Organization")["Enrolled"].sum()
organization_enrollment.sort_values(ascending=False, inplace = True)
organization_enrollment_df = pd.DataFrame(organization_enrollment)
organization_enrollment_df.head(10)

### What is the corrleation between difficulty of a course and number of enrolled students?

#### Organization vs Rating

In [None]:
ratings_organization = coursera.groupby("Organization")["Rating"].count()
ratings_organization.sort_values(ascending=False, inplace = True)
ratings_organization_df = pd.DataFrame(ratings_organization)
ratings_organization_df.head(10)

#### Difficulty vs Rating

In [None]:
ratings_difficulty = coursera.groupby("Difficulty")["Rating"].count()
ratings_difficulty.sort_values(ascending=False, inplace = True)
ratings_difficulty_df = pd.DataFrame(ratings_difficulty)
ratings_difficulty_df.head(10)

In [None]:
organization_title = coursera.groupby("Organization")["Title"].count()
organization_title.sort_values(ascending=False, inplace = True)
organization_title_df = pd.DataFrame(organization_title)
organization_title_df.head(10)

## Outliers

In [None]:
Ignores FutureWarning message that appears with the code below.

warnings.simplefilter(action = "ignore", category = FutureWarning) 

Q1 = top_fifty.quantile(0.25)
Q3 = top_fifty.quantile(0.75)
IQR = Q3 - Q1

outliers_df = (top_fifty < (Q1 - 1.5 * IQR)) | (
    top_fifty > (Q3 + 1.5 * IQR)
)

((top_fifty < (Q1 - 1.5 * IQR)) | (top_fifty > (Q3 + 1.5 * IQR))).sum()

In [None]:
fig, axes = plt.subplots(3, 3, figsize=(20, 20))

sns.boxplot(ax=axes[0, 0], data = top_fifty, x = top_fifty ['acousticness'])
sns.boxplot(ax=axes[0, 1], data = top_fifty, x = top_fifty ['danceability'])
sns.boxplot(ax=axes[0, 2], data = top_fifty, x = top_fifty ['duration_ms'])
sns.boxplot(ax=axes[1, 0], data = top_fifty, x = top_fifty ['instrumentalness'])
sns.boxplot(ax=axes[1, 1], data = top_fifty, x = top_fifty ['liveness'])
sns.boxplot(ax=axes[1, 2], data = top_fifty, x = top_fifty ['loudness'])
sns.boxplot(ax=axes[2, 0], data = top_fifty, x = top_fifty ['speechiness'])

fig.delaxes(ax = axes[2,1]) 
fig.delaxes(ax = axes[2,2]) 