In [747]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings("ignore")

## Coursera

In [748]:
coursera_data_raw = pd.read_csv("../data/Coursera_courses.csv")
coursera_data_raw

Unnamed: 0,name,institution,course_url,course_id
0,Machine Learning,Stanford University,https://www.coursera.org/learn/machine-learning,machine-learning
1,Indigenous Canada,University of Alberta,https://www.coursera.org/learn/indigenous-canada,indigenous-canada
2,The Science of Well-Being,Yale University,https://www.coursera.org/learn/the-science-of-...,the-science-of-well-being
3,Technical Support Fundamentals,Google,https://www.coursera.org/learn/technical-suppo...,technical-support-fundamentals
4,Become a CBRS Certified Professional Installer...,Google - Spectrum Sharing,https://www.coursera.org/learn/google-cbrs-cpi...,google-cbrs-cpi-training
...,...,...,...,...
618,Accounting Data Analytics with Python,University of Illinois at Urbana-Champaign,https://www.coursera.org/learn/accounting-data...,accounting-data-analytics-python
619,Introduction to Molecular Spectroscopy,University of Manchester,https://www.coursera.org/learn/spectroscopy,spectroscopy
620,Managing as a Coach,"University of California, Davis",https://www.coursera.org/learn/managing-as-a-c...,managing-as-a-coach
621,The fundamentals of hotel distribution,ESSEC Business School,https://www.coursera.org/learn/hotel-distribution,hotel-distribution


En primer lugar, se realiza una verificación del tipo de datos y la presencia de valores nulos en cada columna. En este caso particular, se observa que los 623 registros del conjunto de datos no contienen valores nulos, lo que indica que no será necesario realizar ningún tratamiento específico para manejar registros faltantes.

In [749]:
coursera_data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 623 entries, 0 to 622
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   name         623 non-null    object
 1   institution  623 non-null    object
 2   course_url   623 non-null    object
 3   course_id    623 non-null    object
dtypes: object(4)
memory usage: 19.6+ KB


In [750]:
coursera_data = coursera_data_raw[['name', 'institution', 'course_id']]
coursera_data.rename(columns={'name': 'course_name'}, inplace=True)
coursera_data.head()

Unnamed: 0,course_name,institution,course_id
0,Machine Learning,Stanford University,machine-learning
1,Indigenous Canada,University of Alberta,indigenous-canada
2,The Science of Well-Being,Yale University,the-science-of-well-being
3,Technical Support Fundamentals,Google,technical-support-fundamentals
4,Become a CBRS Certified Professional Installer...,Google - Spectrum Sharing,google-cbrs-cpi-training


Llama la atención la cantidad de intituciones que tienen por lo menos un curso en la página de coursera.

In [751]:
coursera_data['institution'].nunique()

134

Por último, se verifica que no existan registros duplicados en el dataset.

In [752]:
coursera_data.duplicated().sum()

0

In [753]:
coursera_reviews_raw = pd.read_csv("../data/Coursera_reviews.csv")
coursera_reviews_raw

Unnamed: 0,reviews,reviewers,date_reviews,rating,course_id
0,"Pretty dry, but I was able to pass with just t...",By Robert S,"Feb 12, 2020",4,google-cbrs-cpi-training
1,would be a better experience if the video and ...,By Gabriel E R,"Sep 28, 2020",4,google-cbrs-cpi-training
2,Information was perfect! The program itself wa...,By Jacob D,"Apr 08, 2020",4,google-cbrs-cpi-training
3,A few grammatical mistakes on test made me do ...,By Dale B,"Feb 24, 2020",4,google-cbrs-cpi-training
4,Excellent course and the training provided was...,By Sean G,"Jun 18, 2020",4,google-cbrs-cpi-training
...,...,...,...,...,...
1454706,g,By Brijesh K,"Aug 25, 2020",5,computer-networking
1454707,.,By Vasavi V M,"Jul 02, 2020",5,computer-networking
1454708,.,By Drishti D,"Jun 20, 2020",5,computer-networking
1454709,.,By FAUSTINE F K,"Jun 07, 2020",5,computer-networking


Al analizar el dataset, se observa que la columna ``reviews`` presenta algunos valores nulos, mientras que las demás columnas no contienen valores nulos. Esta situación es positiva, ya que indica que no será necesario eliminar los registros que no tengan una reseña.

El dataframe de reseñas desempeñará un papel importante en la medición de la actividad de los cursos y sus calificaciones. Aunque algunos registros no tengan valores en la columna ``reviews``, aún se podrán utilizar otros atributos y columnas para analizar la actividad y las calificaciones de los cursos. Esto asegura que se pueda realizar un análisis completo y aprovechar al máximo la información disponible en el dataset.

In [754]:
coursera_reviews_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1454711 entries, 0 to 1454710
Data columns (total 5 columns):
 #   Column        Non-Null Count    Dtype 
---  ------        --------------    ----- 
 0   reviews       1454558 non-null  object
 1   reviewers     1454711 non-null  object
 2   date_reviews  1454711 non-null  object
 3   rating        1454711 non-null  int64 
 4   course_id     1454711 non-null  object
dtypes: int64(1), object(4)
memory usage: 55.5+ MB


Con respecto al tipo de datos de cada columna, se observa que todas las columnas tienen un formato adecuado, excepto la columna ``date_reviews``. Para asegurar una manipulación y análisis preciso de los datos, será necesario realizar el proceso correspondiente para convertir esta columna en un tipo de datos datetime. Esto permitirá trabajar con las fechas de manera eficiente y facilitará futuros cálculos, filtrados y visualizaciones relacionados con las fechas de las reseñas.

Continuando con la verificación de registros duplicados, se ha identificado una cifra alarmante de 934,764 registros duplicados en el dataset. Esto representa aproximadamente el 64% de los datos totales. La presencia de una cantidad tan significativa de registros duplicados puede tener un impacto negativo en el análisis de datos, ya que puede generar sesgos y distorsiones en los resultados obtenidos.

In [755]:
coursera_reviews_raw.duplicated().sum()

934764

Para simplificar el análisis y enfocarnos en los aspectos relevantes, se tomó la decisión de no utilizar el nombre de los usuarios (``reviewers``) en este análisis. Dado que no es una variable que aporte información significativa o relevante para los objetivos específicos del análisis, se considera apropiado desestimarla.

Además, se procede a eliminar los registros duplicados del dataset para garantizar la integridad y la calidad de los datos. Al eliminar los registros duplicados, se asegura que cada entrada en el dataset sea única y represente información única y válida para su posterior análisis.

In [756]:
coursera_reviews = coursera_reviews_raw.drop(['reviewers'], axis=1)
coursera_reviews = coursera_reviews.drop_duplicates()
coursera_reviews.shape[0]

519461

Con respecto a la columna ``date_reviews``, anteriormente se mencionó que no se encontraba en el formato adecuado. Para corregir esto y asegurar un manejo preciso de las fechas, se procedió a convertir esta columna al tipo de dato datetime.

In [757]:
coursera_reviews['date_reviews'] = pd.to_datetime(coursera_reviews['date_reviews'], format="%b %d, %Y")
coursera_reviews.head(3)

Unnamed: 0,reviews,date_reviews,rating,course_id
0,"Pretty dry, but I was able to pass with just t...",2020-02-12,4,google-cbrs-cpi-training
1,would be a better experience if the video and ...,2020-09-28,4,google-cbrs-cpi-training
2,Information was perfect! The program itself wa...,2020-04-08,4,google-cbrs-cpi-training


In [758]:
coursera_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 519461 entries, 0 to 1454644
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   reviews       519400 non-null  object        
 1   date_reviews  519461 non-null  datetime64[ns]
 2   rating        519461 non-null  int64         
 3   course_id     519461 non-null  object        
dtypes: datetime64[ns](1), int64(1), object(2)
memory usage: 19.8+ MB


## edX

In [759]:
edx_data_raw = pd.read_csv("../data/edx_courses.csv")
edx_data_raw

Unnamed: 0,title,summary,n_enrolled,course_type,institution,instructors,Level,subject,language,subtitles,course_effort,course_length,price,course_description,course_syllabus,course_url
0,How to Learn Online,Learn essential strategies for successful onli...,124980,Self-paced on your time,edX,Nina Huntemann-Robyn Belair-Ben Piscopo,Introductory,Education & Teacher Training,English,English,2–3 hours per week,2 Weeks,FREE-Add a Verified Certificate for $49 USD,"Designed for those who are new to elearning, t...",Welcome - We start with opportunities to meet ...,https://www.edx.org/course/how-to-learn-online
1,Programming for Everybody (Getting Started wit...,"This course is a ""no prerequisite"" introductio...",293864,Self-paced on your time,The University of Michigan,Charles Severance,Introductory,Computer Science,English,English,2–4 hours per week,7 Weeks,FREE-Add a Verified Certificate for $49 USD,This course aims to teach everyone the basics ...,,https://www.edx.org/course/programming-for-eve...
2,CS50's Introduction to Computer Science,An introduction to the intellectual enterprise...,2442271,Self-paced on your time,Harvard University,David J. Malan-Doug Lloyd-Brian Yu,Introductory,Computer Science,English,English,6–18 hours per week,12 Weeks,FREE-Add a Verified Certificate for $90 USD,"This is CS50x , Harvard University's introduct...",,https://www.edx.org/course/cs50s-introduction-...
3,The Analytics Edge,"Through inspiring examples and stories, discov...",129555,Instructor-led on a course schedule,Massachusetts Institute of Technology,Dimitris Bertsimas-Allison O'Hair-John Silberh...,Intermediate,Data Analysis & Statistics,English,English,10–15 hours per week,13 Weeks,FREE-Add a Verified Certificate for $199 USD,"In the last decade, the amount of data availab...",,https://www.edx.org/course/the-analytics-edge
4,Marketing Analytics: Marketing Measurement Str...,This course is part of a MicroMasters® Program,81140,Self-paced on your time,"University of California, Berkeley",Stephan Sorger,Introductory,Computer Science,English,English,5–7 hours per week,4 Weeks,FREE-Add a Verified Certificate for $249 USD,Begin your journey in a new career in marketin...,,https://www.edx.org/course/marketing-analytics...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,Leaders in Citizen Security and Justice Manage...,"Learn about the latest in prevention, police a...",,Self-paced on your time,Inter-American Development Bank,Olga Espinoza-Eduardo Pazinato-Alejandra Mera-...,Intermediate,Social Sciences,English,English,4–5 hours per week,10 Weeks,FREE-Add a Verified Certificate for $25 USD,The high rates of crime and violence are two o...,,https://www.edx.org/course/leaders-in-citizen-...
971,Pattern Studying and Making | 图案审美与创作,Fantastic experiences in beauty and its repres...,,Self-paced on your time,Tsinghua University,Yuehua Nie,Introductory,Art & Culture,中文,"English, 中文",3–5 hours per week,12 Weeks,FREE-Add a Verified Certificate for $139 USD,Are you an original designer? Or a DIY fancier...,,https://www.edx.org/course/pattern-studying-an...
972,Computational Neuroscience: Neuronal Dynamics ...,This course explains the mathematical and comp...,11246,Self-paced on your time,École polytechnique fédérale de Lausanne,Wulfram Gerstner,Advanced,Biology & Life Sciences,English,English,4–6 hours per week,6 Weeks,FREE-Add a Verified Certificate for $139 USD,What happens in your brain when you make a dec...,Textbook: Neuronal Dynamics - from single neur...,https://www.edx.org/course/computational-neuro...
973,Cities and the Challenge of Sustainable Develo...,What is a sustainable city? Learn the basics h...,8775,Self-paced on your time,SDG Academy,Jeffrey D. Sachs,Introductory,Environmental Studies,English,English,1–2 hours per week,1 Weeks,FREE-Add a Verified Certificate for $25 USD,"According to the United Nations, urbanization ...",Module 1: Introduction to the SDGsProfessor Je...,https://www.edx.org/course/cities-and-the-chal...


In [760]:
edx_data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 975 entries, 0 to 974
Data columns (total 16 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   title               975 non-null    object
 1   summary             922 non-null    object
 2   n_enrolled          855 non-null    object
 3   course_type         975 non-null    object
 4   institution         975 non-null    object
 5   instructors         972 non-null    object
 6   Level               975 non-null    object
 7   subject             975 non-null    object
 8   language            975 non-null    object
 9   subtitles           972 non-null    object
 10  course_effort       975 non-null    object
 11  course_length       975 non-null    object
 12  price               975 non-null    object
 13  course_description  935 non-null    object
 14  course_syllabus     414 non-null    object
 15  course_url          975 non-null    object
dtypes: object(16)
memory usage

En el dataset de edX, todas las columnas son de tipo string y algunas contienen valores nulos. Al igual que en el dataset anterior, los valores nulos en la mayoría de las columnas no tienen una relevancia significativa para el análisis. Sin embargo, es importante destacar que la columna ``n_enrolled``, que indica la cantidad de estudiantes inscritos en los cursos, puede verse afectada por esta condición de valores nulos. Por lo tanto, se debe tener precaución al analizar y utilizar la información de esta columna en el análisis, ya que los valores nulos podrían afectar la precisión de los resultados relacionados con la cantidad de estudiantes inscritos en los cursos de edX.

In [761]:
edx_data_raw.duplicated().sum()

1

In [762]:
edx_data_raw.isna().sum()

title                   0
summary                53
n_enrolled            120
course_type             0
institution             0
instructors             3
Level                   0
subject                 0
language                0
subtitles               3
course_effort           0
course_length           0
price                   0
course_description     40
course_syllabus       561
course_url              0
dtype: int64

Durante la revisión del dataset, se identificó que algunas columnas no son relevantes para el análisis, como ``course_description`` y ``course_url``. Estas columnas no aportan información directamente relacionada con los aspectos que se desean analizar, por lo que se ha decidido excluirlas del análisis.

Además, se observó que la columna ``course_syllabus`` contiene una cantidad considerable de valores nulos, lo que dificulta su uso en el análisis. Por lo tanto, se optó por utilizar únicamente la columna ``summary`` para obtener información sobre los cursos.

Esta selección de columnas más relevantes, como ``summary``, permitirá concentrarse en los aspectos clave de los cursos y obtener una visión más precisa y concisa de la información necesaria para el análisis.

In [763]:
edx_data = edx_data_raw.drop(['course_description', 'course_syllabus', 'course_url'], axis=1)

Se procede a eliminar el uSe identificó la presencia de un único registro duplicado en el dataset y, por lo tanto, se procedió a eliminarlo.nico registro duplicado que se encontró en el dataset.

In [765]:
edx_data = edx_data.drop_duplicates()
edx_data.shape[0]

974

Con el objetivo de verificar si los valores en la columna ``course_length`` están expresados en la misma unidad de medida, específicamente en semanas, se optó por buscar los valores únicos en esta columna. Dado que el tamaño del dataset es relativamente pequeño, esta búsqueda se realizará para determinar si existe alguna variabilidad en las unidades de medida utilizadas para representar la duración de los cursos.

In [766]:
edx_data['course_length'].unique()

array(['2 Weeks', '7 Weeks', '12 Weeks', '13 Weeks', '4 Weeks', '6 Weeks',
       '10 Weeks', '8 Weeks', '5 Weeks', '16 Weeks', '15 Weeks',
       '1 Weeks', '11 Weeks', '14 Weeks', '9 Weeks', '3 Weeks',
       '18 Weeks', '17 Weeks'], dtype=object)

In [767]:
edx_data['course_length'] = edx_data['course_length'].apply(lambda x: int(x.split()[0]))
edx_data['n_enrolled'] =  edx_data['n_enrolled'].str.replace(',', '')
edx_data['n_enrolled'] =  edx_data['n_enrolled'].fillna('0')
edx_data['n_enrolled'] =  edx_data['n_enrolled'].apply(lambda x: int(x))


In [768]:
edx_data['price'].unique()

array(['FREE-Add a Verified Certificate for $49 USD',
       'FREE-Add a Verified Certificate for $90 USD',
       'FREE-Add a Verified Certificate for $199 USD',
       'FREE-Add a Verified Certificate for $249 USD',
       'FREE-Add a Verified Certificate for $5 USD',
       'FREE-Add a Verified Certificate for $99 USD',
       'FREE-Add a Verified Certificate for $39 USD',
       'FREE-Add a Verified Certificate for $399 USD',
       'FREE-Add a Verified Certificate for $149 USD',
       'FREE-Add a Verified Certificate for $125 USD',
       'FREE-Add a Verified Certificate for $40 USD',
       'FREE-Add a Verified Certificate for $25 USD',
       'FREE-Add a Verified Certificate for $50 USD',
       'FREE-Add a Verified Certificate for $169 USD',
       'FREE-Add a Verified Certificate for $70 USD',
       'FREE-Add a Verified Certificate for $79 USD',
       'FREE-Add a Verified Certificate for $150 USD',
       'FREE-Add a Verified Certificate for $69 USD',
       'FREE-Add a Ver

In [769]:
edx_data['verified_certificate_price'] = edx_data['price'].apply(lambda x: float(x.split('$')[1].split()[0]))
edx_data['price'] = edx_data['price'].apply(lambda x: x.split('-')[0].capitalize())

In [770]:
edx_data.head(3)

Unnamed: 0,title,summary,n_enrolled,course_type,institution,instructors,Level,subject,language,subtitles,course_effort,course_length,price,verified_certificate_price
0,How to Learn Online,Learn essential strategies for successful onli...,124980,Self-paced on your time,edX,Nina Huntemann-Robyn Belair-Ben Piscopo,Introductory,Education & Teacher Training,English,English,2–3 hours per week,2,Free,49.0
1,Programming for Everybody (Getting Started wit...,"This course is a ""no prerequisite"" introductio...",293864,Self-paced on your time,The University of Michigan,Charles Severance,Introductory,Computer Science,English,English,2–4 hours per week,7,Free,49.0
2,CS50's Introduction to Computer Science,An introduction to the intellectual enterprise...,2442271,Self-paced on your time,Harvard University,David J. Malan-Doug Lloyd-Brian Yu,Introductory,Computer Science,English,English,6–18 hours per week,12,Free,90.0


In [771]:
edx_data['course_effort'].unique()

array(['2–3 hours per week', '2–4 hours per week', '6–18 hours per week',
       '10–15 hours per week', '5–7 hours per week',
       '8–10 hours per week', '1–3 hours per week', '3–4 hours per week',
       '3–5 hours per week', '2–6 hours per week', '1–2 hours per week',
       '2–5 hours per week', '4–6 hours per week', '10–30 hours per week',
       '6–9 hours per week', '3–6 hours per week', '5–10 hours per week',
       '4–5 hours per week', '5–8 hours per week', '5–6 hours per week',
       '9–10 hours per week', '4–8 hours per week',
       '15–20 hours per week', '6–8 hours per week',
       '10–14 hours per week', '10–20 hours per week',
       '8–12 hours per week', '4–10 hours per week',
       '10–12 hours per week', '7–10 hours per week',
       '3–7 hours per week', '1–4 hours per week', '6–10 hours per week',
       '1–5 hours per week', '8–9 hours per week', '6–12 hours per week',
       '3–8 hours per week', '1–10 hours per week',
       '10–18 hours per week', '4–12 

In [772]:
def calculate_average(text):
    text = text.split()[0]
    values = text.split('–')
    min_value = int(values[0])
    max_value = int(values[1])
    return (min_value + max_value) / 2

edx_data['course_effort'] = edx_data['course_effort'].apply(calculate_average)

In [773]:
edx_data['instructors'] = edx_data['instructors'].str.split('-')

In [774]:
edx_data['language'].unique()

array(['English', 'Español', 'Italiano', '日本語', 'Français', '中文',
       'Português', 'اللغة العربية', 'Deutsch'], dtype=object)

In [775]:
edx_data['subtitles'] = edx_data['subtitles'].str.split(',')

In [776]:
edx_data.rename(columns = {'title': 'course_name', 
                 'summary': 'course_summary', 
                 'n_enrolled': 'enrrollment_count', 
                 'Level': 'course_level', 'subject': 
                 'course_subject', 'course_effort': 'avg_course_effort', 'course_length': 'duration_in_weeks','price': 'course_price'}, inplace=True)

In [777]:
edx_data.head(3)

Unnamed: 0,course_name,course_summary,enrrollment_count,course_type,institution,instructors,course_level,course_subject,language,subtitles,avg_course_effort,duration_in_weeks,course_price,verified_certificate_price
0,How to Learn Online,Learn essential strategies for successful onli...,124980,Self-paced on your time,edX,"[Nina Huntemann, Robyn Belair, Ben Piscopo]",Introductory,Education & Teacher Training,English,[English],2.5,2,Free,49.0
1,Programming for Everybody (Getting Started wit...,"This course is a ""no prerequisite"" introductio...",293864,Self-paced on your time,The University of Michigan,[Charles Severance],Introductory,Computer Science,English,[English],3.0,7,Free,49.0
2,CS50's Introduction to Computer Science,An introduction to the intellectual enterprise...,2442271,Self-paced on your time,Harvard University,"[David J. Malan, Doug Lloyd, Brian Yu]",Introductory,Computer Science,English,[English],12.0,12,Free,90.0


In [778]:
edx_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 974 entries, 0 to 974
Data columns (total 14 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   course_name                 974 non-null    object 
 1   course_summary              921 non-null    object 
 2   enrrollment_count           974 non-null    int64  
 3   course_type                 974 non-null    object 
 4   institution                 974 non-null    object 
 5   instructors                 971 non-null    object 
 6   course_level                974 non-null    object 
 7   course_subject              974 non-null    object 
 8   language                    974 non-null    object 
 9   subtitles                   971 non-null    object 
 10  avg_course_effort           974 non-null    float64
 11  duration_in_weeks           974 non-null    int64  
 12  course_price                974 non-null    object 
 13  verified_certificate_price  974 non-null

## Udemy

In [779]:
udemy_data_raw = pd.read_csv("../data/udemy_courses.csv")
udemy_data_raw

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance
...,...,...,...,...,...,...,...,...,...,...,...,...
3673,775618,Learn jQuery from Scratch - Master of JavaScri...,https://www.udemy.com/easy-jquery-for-beginner...,True,100,1040,14,21,All Levels,2.0,2016-06-14T17:36:46Z,Web Development
3674,1088178,How To Design A WordPress Website With No Codi...,https://www.udemy.com/how-to-make-a-wordpress-...,True,25,306,3,42,Beginner Level,3.5,2017-03-10T22:24:30Z,Web Development
3675,635248,Learn and Build using Polymer,https://www.udemy.com/learn-and-build-using-po...,True,40,513,169,48,All Levels,3.5,2015-12-30T16:41:42Z,Web Development
3676,905096,CSS Animations: Create Amazing Effects on Your...,https://www.udemy.com/css-animations-create-am...,True,50,300,31,38,All Levels,3.0,2016-08-11T19:06:15Z,Web Development


In [780]:
udemy_data_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3678 entries, 0 to 3677
Data columns (total 12 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   course_id            3678 non-null   int64  
 1   course_title         3678 non-null   object 
 2   url                  3678 non-null   object 
 3   is_paid              3678 non-null   bool   
 4   price                3678 non-null   int64  
 5   num_subscribers      3678 non-null   int64  
 6   num_reviews          3678 non-null   int64  
 7   num_lectures         3678 non-null   int64  
 8   level                3678 non-null   object 
 9   content_duration     3678 non-null   float64
 10  published_timestamp  3678 non-null   object 
 11  subject              3678 non-null   object 
dtypes: bool(1), float64(1), int64(5), object(5)
memory usage: 319.8+ KB


In [781]:
udemy_data_raw.duplicated().sum()

6

In [782]:
udemy_data_raw.isna().sum()

course_id              0
course_title           0
url                    0
is_paid                0
price                  0
num_subscribers        0
num_reviews            0
num_lectures           0
level                  0
content_duration       0
published_timestamp    0
subject                0
dtype: int64

In [783]:
udemy_data = udemy_data_raw.drop(['url'], axis=1)
udemy_data = udemy_data.drop_duplicates()
udemy_data.shape[0]

3672

In [784]:
udemy_data['level'].unique()

array(['All Levels', 'Intermediate Level', 'Beginner Level',
       'Expert Level'], dtype=object)

In [785]:
udemy_data['subject'].unique()

array(['Business Finance', 'Graphic Design', 'Musical Instruments',
       'Web Development'], dtype=object)

In [786]:
udemy_data['published_timestamp'] = pd.to_datetime(udemy_data['published_timestamp'])

In [787]:
udemy_data.rename(columns = {'price': 'course_price', 
                             'level': 'course_level', 
                             'published_timestamp': 'published_date'}, inplace=True)

In [788]:
udemy_data.head(3)

Unnamed: 0,course_id,course_title,is_paid,course_price,num_subscribers,num_reviews,num_lectures,course_level,content_duration,published_date,subject
0,1070968,Ultimate Investment Banking Course,True,200,2147,23,51,All Levels,1.5,2017-01-18 20:58:58+00:00,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,True,75,2792,923,274,All Levels,39.0,2017-03-09 16:34:20+00:00,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19 19:26:30+00:00,Business Finance


In [789]:
udemy_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 3672 entries, 0 to 3677
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype              
---  ------            --------------  -----              
 0   course_id         3672 non-null   int64              
 1   course_title      3672 non-null   object             
 2   is_paid           3672 non-null   bool               
 3   course_price      3672 non-null   int64              
 4   num_subscribers   3672 non-null   int64              
 5   num_reviews       3672 non-null   int64              
 6   num_lectures      3672 non-null   int64              
 7   course_level      3672 non-null   object             
 8   content_duration  3672 non-null   float64            
 9   published_date    3672 non-null   datetime64[ns, UTC]
 10  subject           3672 non-null   object             
dtypes: bool(1), datetime64[ns, UTC](1), float64(1), int64(5), object(3)
memory usage: 319.1+ KB
