- [Introduction](#1) 
- [Content](#2)
- [Data Processing](#3)
    - [Importing the Required Libraries](#4)
    - [Loading the Data Into the DataFrame](#5)
    - [Checking the Types of Data](#6)
    - [Renaming the Columns](#8)
- [Data Cleaning](#11)
    - [Dropping Irrelevant Columns](#7)
    - [Missing Value](#9)
    - [Outlier Detection](#10)
- [Data Visualization and EDA](#12)

# Introduction <a id="1"></a>

Data analysis is a process used to analyze, clean, transform, and model data to discover useful information, draw conclusions, and support decision making. Data analysis has versatile and diverse approaches covering a variety of techniques under various names in different business, science and social science fields. <br>
Data integration is the pioneer of data analysis. Data analysis is closely related to data visualization and data distribution. The term data analysis is sometimes used as a synonym for data modeling.

### Data Analysis Process
Data analysis is a process used to obtain raw data and turn them into useful information for users to make decisions. Data is collected first and then analyzed to answer questions, test hypotheses or reject theories. <br>
Data analysis has several stages. The stages are repetitive.

#### 🟢 Data Requirements
The data required as input to the analysis is selected based on the requirements of the analyst or the customers who will use the result of the analysis. Data can be numerical or categorical.

#### 🟢 Data Collecting
Data can be collected from a variety of sources. Data can be collected by surrounding sensors such as traffic cameras, satellites, recording devices. It is also possible to use interviews, downloads from online resources or documentation.

#### 🟡 Data processing
Data initially obtained must be processed or edited for analysis. For example, these can be placed in rows and columns in a table format for further analysis such as a spreadsheet or statistical software.

#### 🔴 Data Cleaning
Data may be incomplete, duplicate, or contain errors. The need for data cleaning will result from problems with obtaining and storing data. Data cleansing is the process of preventing and correcting these errors. Data cleansing tasks include record matching, detecting data inaccuracy, overall quality of existing data, deduplication, and column segmentation. Such data problems can also be detected through a variety of analytical techniques. Unusual amounts above or below certain threshold values can be examined. Quantitative data methods for outlier detection can be used to remove possible incorrectly entered data.

#### 🔴 Exploratory Data Analysis (EDA)
Various mathematical formulas or models called algorithms can be applied to data to determine relationships between variables, such as correlation or causality. Inference statistics include techniques used to measure relationships between specific variables. <br>
Analysts can try to create models that describe the data to simplify the analysis and communicate results.

# Content <a id="2"></a>

Here, I have extracted data related to 10k courses which come under the development category on Udemy's website.
The 17 columns in the dataset can be used to gain insights related to:

- id : The course ID of that particular course.
- title : Shows the unique names of the courses available under the development category on Udemy.
- url: Gives the URL of the course.
- is_paid : Returns a boolean value displaying true if the course is paid and false if otherwise.
- num_subscribers : Shows the number of people who have subscribed that course.
- avg_rating : Shows the average rating of the course.
- avg rating recent : Reflects the recent changes in the average rating.
- num_reviews : Gives us an idea related to the number of ratings that a course has received.
- num_ published_lectures : Shows the number of lectures the course offers.
- num_ published_ practice_tests : Gives an idea of the number of practice tests that a course offers.
- created : The time of creation of the course.
- published_time : Time of publishing the course.
- discounted_ price_amount : The discounted price which a certain course is being offered at.
- discounted_ price_currency : The currency corresponding to the discounted price which a certain course is being offered at.
- price_ detail_amount : The original price of a particular course.
- price_ detail_currency : The currency corresponding to the price detail amount for a course.

# Data Processing <a id="3"></a>

## Importing the Required Libraries <a id="4"></a>

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
%matplotlib inline 
sns.set(color_codes=True)

import plotly.graph_objects as go
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

## Loading the Data Into the DataFrame <a id="5"></a>

In [None]:
path = "../input/finance-accounting-courses-udemy-13k-course/udemy_output_All_Finance__Accounting_p1_p626.csv"
data = pd.read_csv(path)
df = data.copy()

In [None]:
# to display the top 5 rows
data.head(5)

In [None]:
# to display the bottom 5 rows
data.tail(5) 

## Checking the Types of Data <a id="6"></a>

Here we check for the datatypes because sometimes the MSRP or the price of the car would be stored as a string or object, if in that case, we have to convert that string to the integer data only then we can plot the data via a graph. Here, in this case, the data is already in integer format so nothing to worry.

In [None]:
data.dtypes

 # Data Cleaning <a id="11"><a/>

## Dropping Irrelevant Columns <a id="7"></a>

In [None]:
data.describe().T

In [None]:
data.info()

In [None]:
data.columns.to_list()

I can begin to interpret according to the three outputs above.
- id, url, num_published_practice_tests columns contain data that will not be useful for me when implementing EDA. That's why we're eliminating these columns.

In [None]:
data.drop(['id', 'url', 'num_published_practice_tests', "rating",
           "discount_price__price_string","price_detail__price_string",
          "discount_price__currency","price_detail__currency"], axis=1, inplace=True)
data.head(5)

"""
  id, url: will not be useful for comparing data
  num_published_practice_tests: Data is not consistent as it is used very little in general.
  rating: avg_rating_recent will give us the final rating value
  price_detail__price_string, discount_price__price_string: there are other columns holding the same values
  Since all data uses the same currency, we remove the columns holding currency.
"""

## Renaming the Columns <a id="8"></a>

The names of some columns are too long. We can optimize this.

In [None]:
data = data.rename(columns = {
    "is_paid": "paid",
    "num_subscribers": "subscribers",
    "is_wishlisted": "widthlisted",
    "num_published_lectures": "lectures",
    "published_time": "publish",
    "discount_price__amount": "dp_amount",
    "price_detail__amount": "pd_amount",
})

In [None]:
data.head(5)

## Missing Value <a id="9"></a>

In [None]:
# check NaN value for each columns
for col in data.columns.to_list():
    print("Value: ", data[col].value_counts().index[0])
    print(col, ": ", data[col].value_counts().sum())
    print("Null: ", data[col].isnull().sum())
    print("="*45)

In [None]:
data[data["dp_amount"].isna()].head(5)

In [None]:
data[data["pd_amount"].isna()].head(5)

Let's interpret our data by looking at the 3 outputs above. <br>
There is a lot of NaN value at discounted prices. If there is a large difference between the prices of discounted courses, it is difficult to predict. But if the discounted prices are more or less the same, we can fill in the blank values here by taking the average or the better median. This same process is valid for regular prices.

In [None]:
def get_three_m(col): # mean median mod
    print("========",col,"========")
    print("Mean   :", np.mean(data[~data[col].isna()][col].to_list()))
    print("Median :", np.median(data[~data[col].isna()][col].to_list()))
    print("Mod    :", stats.mode(data[~data[col].isna()][col].to_list())[0][0])

    plt.figure(figsize=(15,5))
    ax = sns.countplot(x=col, data=data)
    plt.xticks(rotation = 90)
    plt.show()

In [None]:
get_three_m("dp_amount")
get_three_m("pd_amount")

When the results and graphs are examined above, we can make the following comments:
- For discounted course prices: It will be sufficient to enter the median value in the NaN incoming values.
- For non-discounted course prices: There is no evenly distributed data in this column. 50 percent of the data is worth 1280 and 8640 coins. Here the mod is 8640. In this case, I care that it is as close to the mod as well as the median.

In [None]:
plt.style.use('fivethirtyeight')
sns.distplot(data[~data["pd_amount"].isna()]["pd_amount"].to_list(), color='green')
plt.show()

In [None]:
dp_amount_nan_indexes = data[data["dp_amount"].isna()].index.to_list()
pd_amount_nan_indexes = data[data["pd_amount"].isna()].index.to_list()

data.loc[dp_amount_nan_indexes] = 455.0
data.loc[pd_amount_nan_indexes] = 3200.0

In [None]:
print("dp_amount nan count :", data["dp_amount"].isnull().sum())
print("pd_amount nan count :", data["pd_amount"].isnull().sum())

In [None]:
plt.style.use('fivethirtyeight')
sns.distplot(data[~data["pd_amount"].isna()]["pd_amount"].to_list(), color='green')
plt.show()

## Outlier Detection <a id="10"></a>

In [None]:
plt.figure(figsize=(15,5))
sns.boxplot(x=data["subscribers"])

In [None]:
for col in data.select_dtypes('float64').columns:
    plt.figure(figsize=(15,5))
    plt.title(col)
    sns.boxplot(x=data[col])

In [None]:
for col in data.select_dtypes('int64').columns:
    plt.figure(figsize=(15,5))
    plt.title(col)
    sns.boxplot(x=data[col])

When the graphs above are examined, we can see that this dataset is rich in outliers. <br>
For example, avg_rating and rating columns can take values between 0-5. But it is impossible to get values like 500,3000. We can delete them or we can actually check Udemy courses and update those with outlier this data.

In [None]:
print("for avg_rating: ", len(data[data['avg_rating'] > 5]))
print("for avg_rating_recent: ", len(data[data['avg_rating_recent'] > 5]))

As you can see, trying to fix them manually is very time consuming. Let's look at the content of the data.

In [None]:
data[data['avg_rating'] > 5]

In [None]:
data[data['avg_rating_recent'] > 5]

The data content is not actually full as seen. So all the thoughts I just made, after seeing this, we saw that it didn't work. The fact that there are numbers in the title section is an indication that this data is completely ghost data. For this reason, we can easily delete these data.

In [None]:
data.drop(data[data['avg_rating'] > 5].index.to_list(), axis=0, inplace=True)
data.drop(data[data['avg_rating_recent'] > 5].index.to_list(), axis=0, inplace=True)

In [None]:
data

The paid column holds a boolean value. Does this apply to all lines?

In [None]:
data['paid'].value_counts()

Sometimes it is necessary to remove some columns after thorough examination. One of them in this column. It will not work for us in data analysis as all courses have a True value. That's why we can remove it.

In [None]:
# data.drop(['paid'], axis=1, inplace=True)
# data.drop(['widthlisted'], axis=1, inplace=True)

In [None]:
data

When we look at the outlier graph above for num_reviews, we can see that some courses have been reviewed too much. This is very natural for the courses that are hit. Generally, the entire distribution is shifted in the 0 direction. This is because of the new and unfamiliar courses that are many times more than the number of hit courses.

In [None]:
print("Max num_reviews : ", np.max(data['num_reviews'].to_list()))
print("Min num_reviews : ", np.min(data['num_reviews'].to_list()))
print("Mean num_reviews: ", np.mean(data['num_reviews'].to_list()))

The creation and release dates of the courses are important to find out how long these courses were prepared by the course provider. But there is too much detail. The year, month and day will be sufficient. That's why we have to date conversion.

In [None]:
data['created']  = pd.to_datetime(data['created'].to_list()).strftime('%m/%d/%Y').values
data['publish']  = pd.to_datetime(data['publish'].to_list()).strftime('%m/%d/%Y').values

In [None]:
data

Let's examine the pd_amount and dp_amount columns.

In [None]:
# for pd_amount
print("Max pd_amount : ", np.max(data['pd_amount'].to_list()))
print("Min pd_amount : ", np.min(data['pd_amount'].to_list()))
print("Mean pd_amount: ", np.mean(data['pd_amount'].to_list()))

In [None]:
# for dp_amount
print("Max dp_amount : ", np.max(data['dp_amount'].to_list()))
print("Min dp_amount : ", np.min(data['dp_amount'].to_list()))
print("Mean dp_amount: ", np.mean(data['dp_amount'].to_list()))

There is no problem in the data. But I want to make a conversion in this column as well. I will convert from Indian currency to Dollar currency which is more common in World.

In [None]:
# 1 Indian Rupee = 0.014 Dolar
data['pd_amount'] = round(data['pd_amount']*0.014,2).to_list()
data['dp_amount'] = round(data['dp_amount']*0.014,2).to_list()

In [None]:
data

# Data Visualization and EDA <a id="12"></a>

In [None]:
# Top 10 courses
data[['title', 'subscribers', 'avg_rating']] \
    .sort_values(by="subscribers", ascending=False)[0:10].set_index('title') \
    .style.format("{:.2f}", subset=['avg_rating']).background_gradient(cmap='Blues', subset = ['avg_rating']) \
    .set_caption('Most subscribed courses') \
    .set_properties(padding="15px", border='2px solid white', width='150px')

In [None]:
data['rating_diff'] = data.avg_rating_recent - data.avg_rating

In [None]:
data[data.subscribers > 10000][['title','subscribers', 'avg_rating', 'avg_rating_recent','rating_diff']] \
.sort_values(by = 'rating_diff')[:10] \
.set_index('title').style \
    .format("{:.4f}", subset = ['avg_rating', 'avg_rating_recent','rating_diff']) \
    .background_gradient(cmap='Blues', subset = ['subscribers']) \
    .bar(align='mid', color=['#FCC0CB', '#90EE90'], subset = ['rating_diff']) \
    .set_caption('Nagative rating change') \
    .set_properties(padding="15px", border='2px solid white', width='150px')

In [None]:
fig = go.Figure(go.Bar(
            x=data.sort_values(by="subscribers", ascending=False).subscribers[0:10],
            y=data.sort_values(by="subscribers", ascending=False).title[0:10],
            orientation='h'))
fig.update_layout(yaxis=dict(autorange="reversed"), title='Top 10 most subscribed courses')

fig.show()

In [None]:
# lectures
df = data.sort_values(by="lectures", ascending=False)
fig = go.Figure(
    data=[go.Bar(
        x=df.lectures[0:10].to_list(), 
        y=df.title[0:10].to_list(), 
        orientation='h')],
)
fig.update_layout(yaxis=dict(autorange="reversed"), title='Top 10 most lectures courses')
fig.show()

In [None]:
data_date = data.groupby(['created']).size()[0:100]
fig = px.line(data_date, 
              x=data_date.index, y=data_date, line_shape = 'linear', title='Created courses', labels={'y': 'Courses'})
fig.update_layout(hovermode='x')
fig.update_xaxes(
    rangeslider_visible=True
)

fig.show()