# **TikTok Project**
**Course 2 - Get Started with Python**

Welcome to the TikTok Project!

You have just started as a data professional at TikTok.

The team is still in the early stages of the project. You have received notice that TikTok's leadership team has approved the project proposal. To gain clear insights to prepare for a claims classification model, TikTok's provided data must be examined to begin the process of exploratory data analysis (EDA).

A notebook was structured and prepared to help you in this project. Please complete the following questions.

# **Course 2 End-of-course project: Inspect and analyze data**

In this activity, you will examine data provided and prepare it for analysis.
<br/>

**The purpose** of this project is to investigate and understand the data provided. This activity will:

1.   Acquaint you with the data

2.   Compile summary information about the data

3.   Begin the process of EDA and reveal insights contained in the data

4.   Prepare you for more in-depth EDA, hypothesis testing, and statistical analysis

**The goal** is to construct a dataframe in Python, perform a cursory inspection of the provided dataset, and inform TikTok data team members of your findings.
<br/>
*This activity has three parts:*

**Part 1:** Understand the situation
* How can you best prepare to understand and organize the provided TikTok information?

**Part 2:** Understand the data

* Create a pandas dataframe for data learning and future exploratory data analysis (EDA) and statistical activities

* Compile summary information about the data to inform next steps

**Part 3:** Understand the variables

* Use insights from your examination of the summary data to guide deeper investigation into variables

<br/>

To complete the activity, follow the instructions and answer the questions below. Then, you will us your responses to these questions and the questions included in the Course 2 PACE Strategy Document to create an executive summary.

Be sure to complete this activity before moving on to Course 3. You can assess your work by comparing the results to a completed exemplar after completing the end-of-course project.

# **Identify data types and compile summary information**


Throughout these project notebooks, you'll see references to the problem-solving framework PACE. The following notebook components are labeled with the respective PACE stage: Plan, Analyze, Construct, and Execute.

# **PACE stages**

<img src="images/Pace.png" width="100" height="100" align=left>

   *        [Plan](#scrollTo=psz51YkZVwtN&line=3&uniqifier=1)
   *        [Analyze](#scrollTo=mA7Mz_SnI8km&line=4&uniqifier=1)
   *        [Construct](#scrollTo=Lca9c8XON8lc&line=2&uniqifier=1)
   *        [Execute](#scrollTo=401PgchTPr4E&line=2&uniqifier=1)

<img src="images/Plan.png" width="100" height="100" align=left>


## **PACE: Plan**

Consider the questions in your PACE Strategy Document and those below to craft your response:



### **Task 1. Understand the situation**

*   How can you best prepare to understand and organize the provided information?


*Begin by exploring your dataset and consider reviewing the Data Dictionary.*

Read the information table (Data Dictionary) of each variable from the data set. Look for correct variables type and formatting, check for null values, duplicates, outliers, strange or non-sense values, and reading key words in the transcription that can help finding patterns related to tagging a text as opinion or claim.

<img src="images/Analyze.png" width="100" height="100" align=left>

## **PACE: Analyze**

Consider the questions in your PACE Strategy Document to reflect on the Analyze stage.

### **Task 2a. Imports and data loading**

Start by importing the packages that you will need to load and explore the dataset. Make sure to use the following import statements:
*   `import pandas as pd`

*   `import numpy as np`


In [8]:
# Import packages
### YOUR CODE HERE ###
import numpy as np
import pandas as pd

Then, load the dataset into a dataframe. Creating a dataframe will help you conduct data manipulation, exploratory data analysis (EDA), and statistical activities.

**Note:** As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [9]:
# Load dataset into dataframe
data = pd.read_csv("tiktok_dataset.csv")

### **Task 2b. Understand the data - Inspect the data**

View and inspect summary information about the dataframe by **coding the following:**

1. `data.head(10)`
2. `data.info()`
3. `data.describe()`

*Consider the following questions:*

**Question 1:** When reviewing the first few rows of the dataframe, what do you observe about the data? What does each row represent?

**Question 2:** When reviewing the `data.info()` output, what do you notice about the different variables? Are there any null values? Are all of the variables numeric? Does anything else stand out?

**Question 3:** When reviewing the `data.describe()` output, what do you notice about the distributions of each variable? Are there any questionable values? Does it seem that there are outlier values?

















In [10]:
# Display and examine the first ten rows of the dataframe
### YOUR CODE HERE ###
data.head(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0
5,6,claim,8972200955,35,someone shared with me that gross domestic pro...,not verified,under review,336647.0,175546.0,62303.0,4293.0,1857.0
6,7,claim,4958886992,16,someone shared with me that elvis presley has ...,not verified,active,750345.0,486192.0,193911.0,8616.0,5446.0
7,8,claim,2270982263,41,someone shared with me that the best selling s...,not verified,active,547532.0,1072.0,50.0,22.0,11.0
8,9,claim,5235769692,50,someone shared with me that about half of the ...,not verified,active,24819.0,10160.0,1050.0,53.0,27.0
9,10,claim,4660861094,45,someone shared with me that it would take a 50...,verified,active,931587.0,171051.0,67739.0,4104.0,2540.0


In [5]:
data.tail(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
19372,19373,,5731766527,16,,not verified,active,,,,,
19373,19374,,5731838072,46,,verified,active,,,,,
19374,19375,,3559825127,42,,not verified,active,,,,,
19375,19376,,2159797367,45,,verified,active,,,,,
19376,19377,,4099538565,7,,not verified,active,,,,,
19377,19378,,7578226840,21,,not verified,active,,,,,
19378,19379,,6079236179,53,,not verified,active,,,,,
19379,19380,,2565539685,10,,verified,under review,,,,,
19380,19381,,2969178540,24,,not verified,active,,,,,
19381,19382,,8132759688,13,,not verified,active,,,,,


In [6]:
data.sample(10)

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
1840,1841,claim,1914077833,15,a friend learned on social media that koalas ...,not verified,active,788915.0,348649.0,130702.0,1382.0,321.0
7068,7069,claim,9413769856,17,a colleague read in a discussion board a clai...,not verified,under review,32082.0,12757.0,4366.0,204.0,33.0
3792,3793,claim,1965190722,17,someone read on an internet forum that the au...,not verified,banned,744428.0,357507.0,67950.0,128.0,51.0
18829,18830,opinion,4318055640,52,my colleagues are convinced that the moon was...,not verified,banned,72.0,14.0,4.0,0.0,0.0
6342,6343,claim,8108606635,52,a colleague learned in a discussion board a c...,not verified,active,452382.0,249413.0,36361.0,3817.0,2216.0
2165,2166,claim,5904191430,55,a friend learned from the news a claim that p...,not verified,active,233079.0,47915.0,5492.0,385.0,65.0
6181,6182,claim,9220962386,7,a colleague learned on social media that most...,not verified,active,354386.0,33358.0,5180.0,748.0,84.0
207,208,claim,2892129815,42,a friend told me someone discovered that at bi...,not verified,active,934185.0,22021.0,8068.0,538.0,134.0
19024,19025,opinion,3896133084,33,in our opinion the hottest planet in our solar...,not verified,active,4215.0,1140.0,50.0,3.0,1.0
15076,15077,opinion,7704837859,9,my family's understanding is that insects are ...,not verified,active,725.0,140.0,10.0,3.0,1.0


In [7]:
# Get summary info
### YOUR CODE HERE ###
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19382 entries, 0 to 19381
Data columns (total 12 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   #                         19382 non-null  int64  
 1   claim_status              19084 non-null  object 
 2   video_id                  19382 non-null  int64  
 3   video_duration_sec        19382 non-null  int64  
 4   video_transcription_text  19084 non-null  object 
 5   verified_status           19382 non-null  object 
 6   author_ban_status         19382 non-null  object 
 7   video_view_count          19084 non-null  float64
 8   video_like_count          19084 non-null  float64
 9   video_share_count         19084 non-null  float64
 10  video_download_count      19084 non-null  float64
 11  video_comment_count       19084 non-null  float64
dtypes: float64(5), int64(3), object(4)
memory usage: 1.8+ MB


In [8]:
# Get summary statistics
### YOUR CODE HERE ###
data.describe()

Unnamed: 0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382.0,19382.0,19382.0,19084.0,19084.0,19084.0,19084.0,19084.0
mean,9691.5,5627454000.0,32.421732,254708.558688,84304.63603,16735.248323,1049.429627,349.312146
std,5595.245794,2536440000.0,16.229967,322893.280814,133420.546814,32036.17435,2004.299894,799.638865
min,1.0,1234959000.0,5.0,20.0,0.0,0.0,0.0,0.0
25%,4846.25,3430417000.0,18.0,4942.5,810.75,115.0,7.0,1.0
50%,9691.5,5618664000.0,32.0,9954.5,3403.5,717.0,46.0,9.0
75%,14536.75,7843960000.0,47.0,504327.0,125020.0,18222.0,1156.25,292.0
max,19382.0,9999873000.0,60.0,999817.0,657830.0,256130.0,14994.0,9599.0


In [32]:
# Get summary statistics
### YOUR CODE HERE ###
data.describe().astype(int).drop(columns=['#', 'video_id'])

#applymap(lambda x: f"{x:0.1f}").drop(columns=['#', 'video_id'])
#.drop(columns=['#', 'video_id', 'video_duration_sec', 'video_view_count', 'video_like_count'])

Unnamed: 0,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
count,19382,19084,19084,19084,19084,19084
mean,32,254708,84304,16735,1049,349
std,16,322893,133420,32036,2004,799
min,5,20,0,0,0,0
25%,18,4942,810,115,7,1
50%,32,9954,3403,717,46,9
75%,47,504327,125020,18222,1156,292
max,60,999817,657830,256130,14994,9599


1. The last five columns, that is, all the columns that counts views, likes, shares, downloads and comments are in float, whilst they must be formatted as integers, since there is no possible to have non-zero decimals for them.

1.1. By using head(10), tail(10) and sample(10) methods I've got the gist of the data frame. There are several null cells.

2. There are 19382 entries and 12 columns, five of them are float, three are integer and the rest are string (object) type. The variables that don't contain null values are: #, video_id, video_duration_sec, verified_status and author_ban_status. It seems that in the cases in which there is no transcription text, reasonably, there is no claim status, but surprisingly there aren't counting of views, shares, likes, downloads or comments. In any case, there are 298 null values.The transcription texts start differently, though it seems the table is sorted by the way the text starts, for example, in the first 10 rows the constant expression is "someone shared with me that". Furthermore, there is not a clear organisation pattern for the other variables, except from that one.

3. Except by the video duration, the rest of numeric variables have a standard deviation so much greater than the mean values. That shows considerably long ranges for those features. Although that could also be a sign of outliers, it's not clear if there are any yet. Besides, the data doesn't have non-sense values like a negative amount of downloads or likes, so far. However, there is a strange video of 5 seconds, which is a very short and uncommon period, and it seems it not easy to do a claim or opinion in a five seconds duration. It will be important to analyse that video and perhaps some other with very few seconds. All the video ID's have 10 digits which could potentially be a characteristic of integrity, though it will be necessary to check for duplicates.

### **Task 2c. Understand the data - Investigate the variables**

In this phase, you will begin to investigate the variables more closely to better understand them.

You know from the project proposal that the ultimate objective is to use machine learning to classify videos as either claims or opinions. A good first step towards understanding the data might therefore be examining the `claim_status` variable. Begin by determining how many videos there are for each different claim status.

In [9]:
data.duplicated().value_counts()

False    19382
dtype: int64

In [10]:
# What are the different values for claim status and how many of each are in the data?
### YOUR CODE HERE ###
data['claim_status'].value_counts()

claim      9608
opinion    9476
Name: claim_status, dtype: int64

In [4]:
9608/19382

0.5034583944665688

In [6]:
9476/19382

0.4889072335156331

In [14]:
data['verified_status'].value_counts()

not verified    18142
verified         1240
Name: verified_status, dtype: int64

In [15]:
1240/19382

0.06397688577030235

In [38]:
verified_agg = (data.drop(columns=['#', 'video_id'])).groupby(['claim_status', 'verified_status', 'author_ban_status']).agg(['count', 'mean', 'median']).astype(int)
verified_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,video_duration_sec,video_duration_sec,video_duration_sec,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count,video_download_count,video_download_count,video_download_count,video_comment_count,video_comment_count,video_comment_count
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,count,mean,median,count,mean,median,count,mean,median,count,mean,median,count,mean,median,count,mean,median
claim_status,verified_status,author_ban_status,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2
claim,not verified,active,6407,32,33,6407,499548,500132,6407,165113,121741,6407,32739,17701,6407,2055,1130,6407,688,286
claim,not verified,banned,1417,32,32,1417,504883,511505,1417,173214,131997,1417,33897,18662,1417,2130,1191,1417,691,295
claim,not verified,under review,1575,32,32,1575,503509,500086,1575,164986,124491,1575,32905,18055,1575,2089,1148,1575,693,270
claim,verified,active,159,33,33,159,486058,494958,159,158775,129316,159,33969,21743,159,1765,1063,159,674,313
claim,verified,banned,22,27,25,22,571861,643261,22,206225,177964,22,44330,30234,22,2861,1503,22,1128,432
claim,verified,under review,28,31,32,28,534696,628683,28,198197,168184,28,47201,27663,28,2609,1455,28,971,339
opinion,not verified,active,7898,32,33,7898,4960,4949,7898,1093,816,7898,217,122,7898,13,7,7898,2,1
opinion,not verified,banned,173,32,32,173,4867,5147,173,1038,803,173,207,99,173,13,6,173,2,1
opinion,not verified,under review,414,30,28,414,4957,4900,414,1131,875,414,216,120,414,14,8,414,2,1
opinion,verified,active,919,31,31,919,4933,4985,919,1079,842,919,211,114,919,13,7,919,2,1


In [41]:
(data.drop(columns=['#', 'video_id'])).groupby(['verified_status','claim_status']).agg(['count', 'mean', 'median']).astype(int)

Unnamed: 0_level_0,Unnamed: 1_level_0,video_duration_sec,video_duration_sec,video_duration_sec,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count,video_download_count,video_download_count,video_download_count,video_comment_count,video_comment_count,video_comment_count
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median,count,mean,median,count,mean,median,count,mean,median
verified_status,claim_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2
not verified,claim,9399,32,32,9399,501016,501453,9399,166313,123480,9399,32941,17952,9399,2072,1141,9399,689,285
not verified,opinion,8485,32,33,8485,4958,4950,8485,1093,820,8485,217,122,8485,13,7,8485,2,1
verified,claim,209,32,32,209,501606,502375,209,169051,132633,209,36832,23294,209,1993,1099,209,762,324
verified,opinion,991,31,31,991,4935,4958,991,1083,842,991,213,119,991,13,7,991,2,1


**Question:** What do you notice about the values shown?

4. Though there are more claim status than opinion, each of them are around half of the data, which is completely balanced.
\
Verified accounts are not a representative part of the dataset so this may be discarded as a variable to work with in this analysis. They are just about 6 %.

Next, examine the engagement trends associated with each different claim status.

Start by using Boolean masking to filter the data according to claim status, then calculate the mean and median view counts for each claim status.

In [44]:
# What is the average view count of videos with "claim" status?
### YOUR CODE HERE ###
mask = (data['claim_status'] == 'claim')
claim_view = data[mask]
mean_view_claim = data[mask]['video_view_count'].mean()
print(mean_view_claim)
median_view_claim = data[mask]['video_view_count'].median()
print(median_view_claim)
#mode_view_claim = (data[mask]['video_view_count']).mode()
#print(mode_view_claim)

501029.4527477102
501555.0


In [45]:
# What is the average and median view count of videos with "opinion" status?
### YOUR CODE HERE ###
mask = (data['claim_status'] == 'opinion')
opinion_view = data[mask]
mean_view_opinion = data[mask]['video_view_count'].mean()
print(mean_view_opinion)
median_view_opinion =  data[mask]['video_view_count'].median()
print(median_view_opinion)

4956.43224989447
4953.0


**Question:** What do you notice about the mean and median within each claim category?

5. In general, the views of a video containing a claim status are a hundred times more seen the opinion ones. In both situations, distributions are kind of symmetrical around the median and mean, since both measures a very close to each other.

Now, examine trends associated with the ban status of the author.

Use `groupby()` to calculate how many videos there are for each combination of categories of claim status and author ban status.

In [26]:
# Get counts for each group combination of claim status and author ban status
### YOUR CODE HERE ###
ban_status_trend = data.groupby(['claim_status', 'author_ban_status']).agg('count')[['#']]
ban_status_trend

Unnamed: 0_level_0,Unnamed: 1_level_0,#
claim_status,author_ban_status,Unnamed: 2_level_1
claim,active,6566
claim,banned,1439
claim,under review,1603
opinion,active,8817
opinion,banned,196
opinion,under review,463


In [None]:
#ban_status_trend['author_claim'] = ban_status_trend['#']/(sum(ban_status_trend.iloc[:3]))

In [54]:
banned_claim = (1439/(6566+1439+1603))
banned_claim

0.14977102414654456

In [55]:
banned_opinion = (196/(8817+196+463))
banned_opinion

0.020683832840861123

In [11]:
data['author_ban_status'].value_counts()

active          15663
under review     2080
banned           1639
Name: author_ban_status, dtype: int64

In [13]:
1639/19382

0.08456299659477866

**Question:** What do you notice about the number of claims videos with banned authors? Why might this relationship occur?

6. The percentage of banned (15 %) and under review authors is around 32 % for 'claim' status videos, compared to a banned of 2 %, and that together with under review of 7 % for 'opinion' status. That means, TikTok tends to punish much harder 'claim' videos, which is consistent with intuition, since they can be considered as truth without user's verification, and then potentially dangerous or misleading.
\
Overall, banned authors only represent around 8 % of all the data set and under review 11 %.

Continue investigating engagement levels, now focusing on `author_ban_status`.

Calculate the median video share count of each author ban status.

In [41]:
### YOUR CODE HERE ###
ban_status_trend = data.groupby(['claim_status', 'author_ban_status']).agg('median')
ban_status_trend

Unnamed: 0_level_0,Unnamed: 1_level_0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
claim_status,author_ban_status,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
claim,active,4843.5,5625380000.0,33.0,499500.5,121943.0,17774.5,1124.5,286.0
claim,banned,4695.0,5534796000.0,32.0,512572.0,132044.0,19018.0,1204.0,296.0
claim,under review,4755.0,5635166000.0,32.0,500774.0,125882.0,18084.0,1153.0,279.0
opinion,active,14349.0,5606485000.0,33.0,4958.0,820.0,121.0,7.0,1.0
opinion,banned,14761.0,6258346000.0,32.0,5083.5,799.5,108.5,6.0,1.0
opinion,under review,14176.0,5417117000.0,28.0,4884.0,876.0,124.0,8.0,1.0


In [42]:
# What's the median video share count of each author ban status?
### YOUR CODE HERE ###
median_share_author = data.groupby('author_ban_status').agg('median')
median_share_author

Unnamed: 0_level_0,#,video_id,video_duration_sec,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count
author_ban_status,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
active,10966.0,5624036000.0,33.0,8616.0,2222.0,437.0,28.0,5.0
banned,5304.0,5563176000.0,32.0,448201.0,105573.0,14468.0,892.0,209.0
under review,6175.5,5607722000.0,31.0,365245.5,71204.5,9444.0,610.5,136.5


**Question:** What do you notice about the share count of banned authors, compared to that of active authors? Explore this in more depth.

7. When videos have a big amount of sharing and they contain claim or opinion, their content is reviewed more often and they tend to be classified as inappropriate more than low-sharing videos. That also shows a good TikTok detector of misleading or harmful content. Furthermore, the median of shares of a banned author is 33 times greater than active ones.

Use `groupby()` to group the data by `author_ban_status`, then use `agg()` to get the count, mean, and median of each of the following columns:
* `video_view_count`
* `video_like_count`
* `video_share_count`

Remember, the argument for the `agg()` function is a dictionary whose keys are columns. The values for each column are a list of the calculations you want to perform.

In [47]:
### YOUR CODE HERE ###
#
author_stats = data.groupby('author_ban_status').agg(['count','mean', 'median']).drop(columns=['#', 'video_id', 'video_duration_sec'])
author_stats

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count,video_download_count,video_download_count,video_download_count,video_comment_count,video_comment_count,video_comment_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0,15383,882.276344,28.0,15383,295.134499,5.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0,1635,1886.296024,892.0,1635,614.956575,209.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0,2066,1631.734753,610.5,2066,542.480639,136.5


In [48]:
author_stats[['video_view_count', 'video_like_count', 'video_share_count']]

Unnamed: 0_level_0,video_view_count,video_view_count,video_view_count,video_like_count,video_like_count,video_like_count,video_share_count,video_share_count,video_share_count
Unnamed: 0_level_1,count,mean,median,count,mean,median,count,mean,median
author_ban_status,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
active,15383,215927.039524,8616.0,15383,71036.533836,2222.0,15383,14111.466164,437.0
banned,1635,445845.439144,448201.0,1635,153017.236697,105573.0,1635,29998.942508,14468.0
under review,2066,392204.836399,365245.5,2066,128718.050339,71204.5,2066,25774.696999,9444.0


**Question:** What do you notice about the number of views, likes, and shares for banned authors compared to active authors

Now, create three new columns to help better understand engagement rates:
* `likes_per_view`: represents the number of likes divided by the number of views for each video
* `comments_per_view`: represents the number of comments divided by the number of views for each video
* `shares_per_view`: represents the number of shares divided by the number of views for each video

In [43]:
# Create a likes_per_view column
### YOUR CODE HERE ###
data['likes_per_view'] = data['video_like_count'] / data['video_view_count']
# Create a comments_per_view column
### YOUR CODE HERE ###
data['comments_per_view'] = data['video_comment_count'] / data['video_view_count']
# Create a shares_per_view column
### YOUR CODE HERE ###
data['shares_per_view'] = data['video_share_count'] / data['video_view_count']

In [57]:
data

Unnamed: 0,#,claim_status,video_id,video_duration_sec,video_transcription_text,verified_status,author_ban_status,video_view_count,video_like_count,video_share_count,video_download_count,video_comment_count,likes_per_view,comments_per_view,shares_per_view
0,1,claim,7017666017,59,someone shared with me that drone deliveries a...,not verified,under review,343296.0,19425.0,241.0,1.0,0.0,0.056584,0.000000,0.000702
1,2,claim,4014381136,32,someone shared with me that there are more mic...,not verified,active,140877.0,77355.0,19034.0,1161.0,684.0,0.549096,0.004855,0.135111
2,3,claim,9859838091,31,someone shared with me that american industria...,not verified,active,902185.0,97690.0,2858.0,833.0,329.0,0.108282,0.000365,0.003168
3,4,claim,1866847991,25,someone shared with me that the metro of st. p...,not verified,active,437506.0,239954.0,34812.0,1234.0,584.0,0.548459,0.001335,0.079569
4,5,claim,7105231098,19,someone shared with me that the number of busi...,not verified,active,56167.0,34987.0,4110.0,547.0,152.0,0.622910,0.002706,0.073175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19377,19378,,7578226840,21,,not verified,active,,,,,,,,
19378,19379,,6079236179,53,,not verified,active,,,,,,,,
19379,19380,,2565539685,10,,verified,under review,,,,,,,,
19380,19381,,2969178540,24,,not verified,active,,,,,,,,


Use `groupby()` to compile the information in each of the three newly created columns for each combination of categories of claim status and author ban status, then use `agg()` to calculate the count, the mean, and the median of each group.

In [60]:
### YOUR CODE HERE ###
per_view_stats = data.groupby(['claim_status', 'author_ban_status']).agg(['count','mean', 'median']).drop(columns=['#', 'video_id', 'video_duration_sec'])
per_view_stats[['likes_per_view', 'comments_per_view', 'shares_per_view']]

Unnamed: 0_level_0,Unnamed: 1_level_0,likes_per_view,likes_per_view,likes_per_view,comments_per_view,comments_per_view,comments_per_view,shares_per_view,shares_per_view,shares_per_view
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,median,count,mean,median,count,mean,median
claim_status,author_ban_status,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
claim,active,6566,32.954203,32.653813,6566,0.139263,0.077551,6566,6.545579,4.927911
claim,banned,1439,34.507055,35.890867,1439,0.137719,0.074562,1439,6.789342,5.160571
claim,under review,1603,32.799666,32.086742,1603,0.136709,0.078887,1603,6.573347,4.99666
opinion,active,8817,21.974352,21.833031,8817,0.051708,0.025176,8817,4.372857,3.240497
opinion,banned,196,20.68684,19.848304,196,0.043436,0.019277,196,4.053072,3.072761
opinion,under review,463,22.639437,22.80508,463,0.053556,0.029261,463,4.447242,3.502682


**Question:**

How does the data for claim videos and opinion videos compare or differ? Consider views, comments, likes, and shares.

9. There are more interactions from users when a video has 'claim' status than 'opinion'.\
*'claim' status: around three of 10 views are likes; 1 of 1000 viewers make a comment, and 7 of 100 share the video.\
*'opinion' status: approximately 2 of 10 viewers like the video; 5 of 10,000 write a comment, and 4 of 100 share it.
\\

When compared within each status category, user's interactions are basically the same for active, banned and under review authors. Mean and median are also very close to each other, showing a symmetrical behaviour.

<img src="images/Construct.png" width="100" height="100" align=left>

## **PACE: Construct**

**Note**: The Construct stage does not apply to this workflow. The PACE framework can be adapted to fit the specific requirements of any project.




<img src="images/Execute.png" width="100" height="100" align=left>

## **PACE: Execute**

Consider the questions in your PACE Strategy Document and those below to craft your response.

### **Given your efforts, what can you summarize for Rosie Mae Bradshaw and the TikTok data team?**

*Note for Learners: Your answer should address TikTok's request for a summary that covers the following points:*

*   What percentage of the data is comprised of claims and what percentage is comprised of opinions?
*   What factors correlate with a video's claim status?
*   What factors correlate with a video's engagement level?


9. Even if null values are counted, both claim and opinion status videos are around 50 % of the data, In addition null values are less than 2 %, so it's a well balanced and representative data set. The higher the interactions as likes, comments, shares, download, the more possible to be a claim than an opinion, and the more likely to be. though not as percentage of views.

**Congratulations!** You've completed this lab. However, you may not notice a green check mark next to this item on Coursera's platform. Please continue your progress regardless of the check mark. Just click on the "save" icon at the top of this notebook to ensure your work has been logged.