<a href="https://colab.research.google.com/github/RANA1804/Introduction_to_machine_learning/blob/main/03_Understanding_the_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Understanding the Data

Exploring and comprehending your data is a fundamental step in the data analysis process. Python, equipped with diverse libraries, offers robust tools for this purpose. Through efficient data exploration, you gain valuable insights into your dataset, paving the way for informed decision-making and further analysis. From loading and summarizing data with pandas to visualizing patterns with Matplotlib and Seaborn, Python facilitates a comprehensive understanding of the data landscape. This initial understanding sets the stage for subsequent tasks, such as handling missing values, transforming features, and conducting statistical analyses, ensuring a solid foundation for meaningful data-driven insights.

## Import required Libraries

In [46]:
from google.colab import drive
drive.mount("/content/drive/")

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [47]:
import pandas as pd

## Read the Data
### Encoding Error
Character encoding is crucial when dealing with text data, as it defines the mapping between textual characters and the binary representations used by computers. The most common character encodings include UTF-8, UTF-16, and ISO-8859-1, among others.

When you read data into a pandas DataFrame from an external source, such as a CSV file, you might encounter different character encodings. In such cases, you can use the encoding parameter of pandas' read_csv() function to specify the encoding of the file.

In [48]:
csv_path = r"/content/drive/MyDrive/1.ML/datasets/Global YouTube Statistics.csv"
# df = pd.read_csv(csv_path) # This will give a error because there is a encoding problem
df = pd.read_csv(csv_path, encoding = "latin-1")
df

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,2.280000e+11,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1.366418e+09,5.36,471031528.0,20.593684,78.962880
1,2,YouTube Movies,170000000,0.000000e+00,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,3.282395e+08,14.70,270663028.0,37.090240,-95.712891
2,3,MrBeast,166000000,2.836884e+10,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,3.282395e+08,14.70,270663028.0,37.090240,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,1.640000e+11,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,3.282395e+08,14.70,270663028.0,37.090240,-95.712891
4,5,SET India,159000000,1.480000e+11,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1.366418e+09,5.36,471031528.0,20.593684,78.962880
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,991,Natan por Aï¿,12300000,9.029610e+09,Sports,Natan por Aï¿,1200,Brazil,BR,Entertainment,...,700000.0,2017.0,Feb,12.0,51.3,2.125594e+08,12.08,183241641.0,-14.235004,-51.925280
991,992,Free Fire India Official,12300000,1.674410e+09,People & Blogs,Free Fire India Official,1500,India,IN,Games,...,300000.0,2018.0,Sep,14.0,28.1,1.366418e+09,5.36,471031528.0,20.593684,78.962880
992,993,Panda,12300000,2.214684e+09,,HybridPanda,2452,United Kingdom,GB,Games,...,1000.0,2006.0,Sep,11.0,60.0,6.683440e+07,3.85,55908316.0,55.378051,-3.435973
993,994,RobTopGames,12300000,3.741235e+08,Gaming,RobTopGames,39,Sweden,SE,Games,...,100000.0,2012.0,May,9.0,67.0,1.028545e+07,6.48,9021165.0,60.128161,18.643501


## How big is the data?

In [49]:
df.shape #in this case shape is a attribute of the class of df or dataframe.

(995, 28)

In [50]:
type(df)

pandas.core.frame.DataFrame

Pandas has two classes:- dataframe and datatypes.

## How does the data look like?

In [51]:
df.head() #Bydefoult it shows the 5 rowws.

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,1,T-Series,245000000,228000000000.0,Music,T-Series,20082,India,IN,Music,...,2000000.0,2006.0,Mar,13.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288
1,2,YouTube Movies,170000000,0.0,Film & Animation,youtubemovies,1,United States,US,Games,...,,2006.0,Mar,5.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
2,3,MrBeast,166000000,28368840000.0,Entertainment,MrBeast,741,United States,US,Entertainment,...,8000000.0,2012.0,Feb,20.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
3,4,Cocomelon - Nursery Rhymes,162000000,164000000000.0,Education,Cocomelon - Nursery Rhymes,966,United States,US,Education,...,1000000.0,2006.0,Sep,1.0,88.2,328239500.0,14.7,270663028.0,37.09024,-95.712891
4,5,SET India,159000000,148000000000.0,Shows,SET India,116536,India,IN,Entertainment,...,1000000.0,2006.0,Sep,20.0,28.1,1366418000.0,5.36,471031528.0,20.593684,78.96288


## Rendomly selecting the rows

In [52]:
df.sample(5)

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
233,234,Las Ratitas,25200000,9602247000.0,People & Blogs,LAS RATITAS,4,,,People,...,,2019.0,May,31.0,,,,,,
796,797,Luli Pampï¿½,14000000,8623705000.0,Music,Luli Pampï¿½,294,Spain,ES,Music,...,100000.0,2012.0,Dec,22.0,88.9,47076781.0,13.96,37927409.0,40.463667,-3.74922
797,798,Gallina Pintadita,14000000,9660951000.0,Music,Gallina Pintadita,62,Mexico,MX,Film,...,,2011.0,Aug,2.0,40.2,126014024.0,3.42,102626859.0,23.634501,-102.552784
82,83,WorkpointOfficial,39000000,36131230000.0,Entertainment,WorkpointOfficial,72580,Thailand,TH,Entertainment,...,200000.0,2012.0,Nov,5.0,49.3,69625582.0,0.75,35294600.0,15.870032,100.992541
496,497,Jane ASMR ï¿½ï¿½,17700000,7387622000.0,,Jane ASMR ï¿½ï¿½,1888,South Korea,KR,People,...,,2012.0,Nov,17.0,94.3,51709098.0,4.15,42106719.0,35.907757,127.766922


## What are the datatype of the collumns?

In [53]:
df.info() # This is a function

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 28 columns):
 #   Column                                   Non-Null Count  Dtype  
---  ------                                   --------------  -----  
 0   rank                                     995 non-null    int64  
 1   Youtuber                                 995 non-null    object 
 2   subscribers                              995 non-null    int64  
 3   video views                              995 non-null    float64
 4   category                                 949 non-null    object 
 5   Title                                    995 non-null    object 
 6   uploads                                  995 non-null    int64  
 7   Country                                  873 non-null    object 
 8   Abbreviation                             873 non-null    object 
 9   channel_type                             965 non-null    object 
 10  video_views_rank                         994 non-n

## Are there any null  values?

In [54]:
df.isnull()  # If there is any nul values then the results will be "True".

Unnamed: 0,rank,Youtuber,subscribers,video views,category,Title,uploads,Country,Abbreviation,channel_type,...,subscribers_for_last_30_days,created_year,created_month,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
990,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
991,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
992,False,False,False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
993,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


How many null values are existing in a collumn?

In [55]:
df.isnull().sum()

rank                                         0
Youtuber                                     0
subscribers                                  0
video views                                  0
category                                    46
Title                                        0
uploads                                      0
Country                                    122
Abbreviation                               122
channel_type                                30
video_views_rank                             1
country_rank                               116
channel_type_rank                           33
video_views_for_the_last_30_days            56
lowest_monthly_earnings                      0
highest_monthly_earnings                     0
lowest_yearly_earnings                       0
highest_yearly_earnings                      0
subscribers_for_last_30_days               337
created_year                                 5
created_month                                5
created_date 

## How does the data lock mathematically?

In [56]:
df.describe()

Unnamed: 0,rank,subscribers,video views,uploads,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,...,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
count,995.0,995.0,995.0,995.0,994.0,879.0,962.0,939.0,995.0,995.0,...,995.0,658.0,990.0,990.0,872.0,872.0,872.0,872.0,872.0,872.0
mean,498.0,22982410.0,11039540000.0,9187.125628,554248.9,386.05347,745.719335,175610300.0,36886.148281,589807.8,...,7081814.0,349079.1,2012.630303,15.746465,63.627752,430387300.0,9.279278,224215000.0,26.632783,-14.128146
std,287.37606,17526110.0,14110840000.0,34151.352254,1362782.0,1232.244746,1944.386561,416378200.0,71858.724092,1148622.0,...,13797040.0,614355.4,4.512503,8.77752,26.106893,472794700.0,4.888354,154687400.0,20.560533,84.760809
min,1.0,12300000.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,...,0.0,1.0,1970.0,1.0,7.6,202506.0,0.75,35588.0,-38.416097,-172.104629
25%,249.5,14500000.0,4288145000.0,194.5,323.0,11.0,27.0,20137500.0,2700.0,43500.0,...,521750.0,100000.0,2009.0,8.0,36.3,83355410.0,5.27,55908320.0,20.593684,-95.712891
50%,498.0,17700000.0,7760820000.0,729.0,915.5,51.0,65.5,64085000.0,13300.0,212700.0,...,2600000.0,200000.0,2013.0,16.0,68.0,328239500.0,9.365,270663000.0,37.09024,-51.92528
75%,746.5,24600000.0,13554700000.0,2667.5,3584.5,123.0,139.75,168826500.0,37900.0,606800.0,...,7300000.0,400000.0,2016.0,23.0,88.2,328239500.0,14.7,270663000.0,37.09024,78.96288
max,995.0,245000000.0,228000000000.0,301308.0,4057944.0,7741.0,7741.0,6589000000.0,850900.0,13600000.0,...,163400000.0,8000000.0,2022.0,31.0,113.1,1397715000.0,14.72,842934000.0,61.92411,138.252924


## Are there any duplicated rows?

In [57]:
df.duplicated()

0      False
1      False
2      False
3      False
4      False
       ...  
990    False
991    False
992    False
993    False
994    False
Length: 995, dtype: bool

In [58]:
df.duplicated().sum()

0

## How is tthe correlation between the columns?
The df.corr() function in pandas is used to calculate the correlation matrix for a DataFrame. The correlation matrix is a table showing correlation coefficients between variables. These coefficients quantify the degree to which two variables are linearly related.

In [59]:
# Extract the correlation between all the veriables
df.corr()

  df.corr()


Unnamed: 0,rank,subscribers,video views,uploads,video_views_rank,country_rank,channel_type_rank,video_views_for_the_last_30_days,lowest_monthly_earnings,highest_monthly_earnings,...,highest_yearly_earnings,subscribers_for_last_30_days,created_year,created_date,Gross tertiary education enrollment (%),Population,Unemployment rate,Urban_population,Latitude,Longitude
rank,1.0,-0.640608,-0.453363,-0.051036,-0.059455,0.016776,-0.029554,-0.186339,-0.248394,-0.24805,...,-0.248392,-0.188571,0.106025,-0.006256,-0.037491,-0.025475,-0.01486,-0.038807,3.6e-05,0.019003
subscribers,-0.640608,1.0,0.750958,0.077136,0.057202,0.032683,0.027393,0.278846,0.388941,0.388579,...,0.388935,0.309527,-0.141827,-0.011836,-0.006804,0.082219,-0.008251,0.083521,0.01945,0.022443
video views,-0.453363,0.750958,1.0,0.165928,-0.061807,-0.068277,-0.050194,0.361856,0.552096,0.551455,...,0.552091,0.187384,-0.127068,-0.03818,-0.015232,0.080214,-0.000729,0.076649,0.037334,0.031268
uploads,-0.051036,0.077136,0.165928,1.0,-0.108988,-0.078394,-0.09845,0.101521,0.166922,0.167283,...,0.166904,0.008933,-0.154904,0.0349,-0.218396,0.143122,-0.188101,0.072807,-0.067868,0.233169
video_views_rank,-0.059455,0.057202,-0.061807,-0.108988,1.0,0.877504,0.949936,-0.067193,-0.208863,-0.208935,...,-0.208851,-0.167295,0.006671,0.031231,0.046934,-0.103178,-0.029276,-0.122747,0.015932,-0.016492
country_rank,0.016776,0.032683,-0.068277,-0.078394,0.877504,1.0,0.898442,-0.098737,-0.148947,-0.14896,...,-0.148946,-0.126175,-0.037807,-0.012699,0.10329,-0.053181,0.066697,-0.024578,0.048323,-0.072476
channel_type_rank,-0.029554,0.027393,-0.050194,-0.09845,0.949936,0.898442,1.0,-0.129051,-0.187908,-0.18797,...,-0.187896,-0.154021,-0.014002,0.038299,0.062484,-0.116254,0.003697,-0.123852,0.010195,-0.055144
video_views_for_the_last_30_days,-0.186339,0.278846,0.361856,0.101521,-0.067193,-0.098737,-0.129051,1.0,0.68033,0.680289,...,0.68033,0.451523,0.053123,-0.01367,-0.03561,0.053859,-0.002323,0.051126,-0.026864,0.049033
lowest_monthly_earnings,-0.248394,0.388941,0.552096,0.166922,-0.208863,-0.148947,-0.187908,0.68033,1.0,0.999955,...,0.999998,0.67936,0.072316,-0.040269,-0.06219,0.104812,-0.042874,0.081206,0.006583,0.100379
highest_monthly_earnings,-0.24805,0.388579,0.551455,0.167283,-0.208935,-0.14896,-0.18797,0.680289,0.999955,1.0,...,0.999953,0.679699,0.072289,-0.039959,-0.061973,0.104785,-0.042627,0.081226,0.006873,0.100299


## Correlation of a perticular column with all the columns

In [60]:
df.corr()["subscribers"]

  df.corr()["subscribers"]


rank                                      -0.640608
subscribers                                1.000000
video views                                0.750958
uploads                                    0.077136
video_views_rank                           0.057202
country_rank                               0.032683
channel_type_rank                          0.027393
video_views_for_the_last_30_days           0.278846
lowest_monthly_earnings                    0.388941
highest_monthly_earnings                   0.388579
lowest_yearly_earnings                     0.389072
highest_yearly_earnings                    0.388935
subscribers_for_last_30_days               0.309527
created_year                              -0.141827
created_date                              -0.011836
Gross tertiary education enrollment (%)   -0.006804
Population                                 0.082219
Unemployment rate                         -0.008251
Urban_population                           0.083521
Latitude    

## Creation of a DataqFrame and calculation of correlation matrix

In [61]:
data = {"a" : [2,4,6,5],
        "b" : [2,4,7,8],
        "c" : [5,4,5,7]}

In [62]:
df = pd.DataFrame(data)
df

Unnamed: 0,a,b,c
0,2,2,5
1,4,4,4
2,6,7,5
3,5,8,7


In [63]:
column_matrix = df.corr()
column_matrix

Unnamed: 0,a,b,c
a,1.0,0.903682,0.271448
b,0.903682,1.0,0.649331
c,0.271448,0.649331,1.0


In [64]:
df.corr()["a"]

a    1.000000
b    0.903682
c    0.271448
Name: a, dtype: float64