# **Testing My Hypothesis**

Now, we know that the sample data tracks reactions per piece of content, but can we actually determine the number of posts uploaded each month? Let's explore.

At first, it seemed straightforward to determine monthly post volume by counting unique 'Content IDs' each month. January initially appeared to have the highest volume, which fit with users likely reconnecting after the holidays. However, the values on the Tableau heatmap didn’t add up as expected.

Closer examination showed only 962 unique content IDs in total, but with an average of 733 unique IDs per month, resulting in an annual total that exceeded the dataset’s unique IDs — a clear indication of discrepancies. This led to a deeper investigation here using Python, with two main hypotheses to explore.

<br>

### **Import Datasets**

In [56]:
import pandas as pd

In [81]:
rdf = pd.read_csv("Reactions.csv")
cdf = pd.read_csv("Content.csv")
rtdf = pd.read_csv("ReactionTypes.csv")

<br>

### **Apply Original Cleaning and Merge Data**

***But*** do not remove blank rows from the `"Reaction Type"` column.

In [82]:
# Rename "Type" column to "Reaction Type"
rdf = rdf.rename(columns={"Type": "Reaction Type"})
rdf.head(1)

Unnamed: 0.1,Unnamed: 0,Content ID,User ID,Reaction Type,Datetime
0,0,97522e57-d9ab-4bd6-97bf-c24d952602d2,,,2021-04-22 15:17:15


In [83]:
# Rename "Type" column to "Content Type"
cdf = cdf.rename(columns={"Type": "Content Type"})
cdf.head(1)

Unnamed: 0.1,Unnamed: 0,Content ID,User ID,Content Type,Category,URL
0,0,97522e57-d9ab-4bd6-97bf-c24d952602d2,8d3cd87d-8a31-4935-9a4f-b319bfe05f31,photo,Studying,https://socialbuzz.cdn.com/content/storage/975...


In [84]:
# Rename "Type" column to "Reaction Type"
rtdf = rtdf.rename(columns={"Type": "Reaction Type"})
rtdf.head(1)

Unnamed: 0.1,Unnamed: 0,Reaction Type,Sentiment,Score
0,0,heart,positive,60


In [85]:
# Convert to datetime, matching the actual format
rdf['Datetime'] = pd.to_datetime(rdf['Datetime'], format='%Y-%m-%d %H:%M:%S', errors='coerce')

# Convert to datetime without specifying format
rdf['Datetime'] = pd.to_datetime(rdf['Datetime'], errors='coerce')

# Confirm changes
rdf.dtypes

Unnamed: 0                int64
Content ID               object
User ID                  object
Reaction Type            object
Datetime         datetime64[ns]
dtype: object

In [86]:
# Drop "User ID" and "URL" columns
rdf = rdf.drop(columns=['User ID'])
cdf = cdf.drop(columns=['User ID'])
cdf = cdf.drop(columns=['URL'])

In [87]:
# Remove the quotation marks and convert all text to lowercase
cdf['Category'] = cdf['Category'].str.replace('"', '', regex=False)
cdf["Category"] = cdf["Category"].str.lower()
cdf['Category'].unique()

array(['studying', 'healthy eating', 'technology', 'food', 'cooking',
       'dogs', 'soccer', 'public speaking', 'science', 'tennis', 'travel',
       'fitness', 'education', 'veganism', 'animals', 'culture'],
      dtype=object)

<br>

### **Merge Datasets & Conduct Additional Cleaning**

In [88]:
# Merge with outer join to include all rows from both DataFrames
test_df = rdf.merge(cdf, on="Content ID", how="outer")
test_df = test_df.merge(rtdf, on="Reaction Type", how="outer")
test_df.head(2)

Unnamed: 0.1,Unnamed: 0_x,Content ID,Reaction Type,Datetime,Unnamed: 0_y,Content Type,Category,Unnamed: 0,Sentiment,Score
0,15950.0,00d0cdf9-5919-4102-bf84-ebde253c3cd2,adore,2020-09-25 19:56:00,617,audio,healthy eating,9.0,positive,72.0
1,15956.0,00d0cdf9-5919-4102-bf84-ebde253c3cd2,adore,2020-07-20 08:41:41,617,audio,healthy eating,9.0,positive,72.0


In [89]:
# Remove all 3 "Unnamed" columns
test_df = test_df.drop(columns=["Unnamed: 0_x", "Unnamed: 0_y", "Unnamed: 0"])
test_df.head(2)

Unnamed: 0,Content ID,Reaction Type,Datetime,Content Type,Category,Sentiment,Score
0,00d0cdf9-5919-4102-bf84-ebde253c3cd2,adore,2020-09-25 19:56:00,audio,healthy eating,positive,72.0
1,00d0cdf9-5919-4102-bf84-ebde253c3cd2,adore,2020-07-20 08:41:41,audio,healthy eating,positive,72.0


In [90]:
# Inspect for null values
test_df.isnull().sum()

Content ID          0
Reaction Type    1000
Datetime           20
Content Type        0
Category            0
Sentiment        1000
Score            1000
dtype: int64

In [91]:
# Remove all rows with blank values in the "Datetime" column
test_df.dropna(subset=["Datetime"], inplace=True)

In [92]:
# Inspect null values again
print(f"Total Rows | {test_df.shape[0]}")
test_df.isnull().sum()

Total Rows | 25553


Content ID         0
Reaction Type    980
Datetime           0
Content Type       0
Category           0
Sentiment        980
Score            980
dtype: int64

<br>

# **Begin Test**

There are two phases to this. Now that we have a clean dataset that includes the 980 rows removed from the previous analysis, let's check if these rows represent the original upload dates for each piece of content. If we cannot confirm this hypothesis, we will then investigate the hypothesis that each `"Content ID"` does not at any point, represent data of content being generated and posted, but rather indicates posts that have been reacted to, and therefore can occur multiple times across different months.

### **Phase 1:**

***1***- Confirm the number of unique `"Content ID"`'s matches the number of rows with blanks in the `"Reaction Type"` column.

***2***- Retrieve and display the first chronological occurrence of each unique `"Content ID"` in the dataset, sorted by date and time.
* If this hypothesis is correct the first occurrence of each unique `"Content ID"` should have a blank value in the `"Reaction Type"`, `"Sentiment"`, and `"Score"` columns.
* If not then we move onto the next phase.

In [93]:
# Confirm the number of unique "Content ID" values
print(f"Number of Unique Content ID's | {test_df['Content ID'].nunique()}")

Number of Unique Content ID's | 980


In [94]:
# Sort the DataFrame by "Datetime" in ascending order
test_df_sorted = test_df.sort_values(by="Datetime")

# Group by "Content ID" and get the first occurrence based on the sorted Datetime
first_occurrences = test_df_sorted.groupby("Content ID").first().reset_index()

# Sort first_occurrences by "Datetime" in ascending order
first_occurrences = first_occurrences.sort_values(by="Datetime").reset_index(drop=True)

# Display the sorted result
print(f"Number of First Occurrences | {first_occurrences['Content ID'].nunique()}")
first_occurrences.head(2)

Number of First Occurrences | 980


Unnamed: 0,Content ID,Reaction Type,Datetime,Content Type,Category,Sentiment,Score
0,fada6910-2cc5-4600-808c-1e6066f795a6,cherish,2020-06-18 07:59:17,GIF,technology,positive,70.0
1,a727ed7f-5684-4536-b543-8e8fc93f40b1,hate,2020-06-18 08:07:22,video,cooking,negative,5.0


In [95]:
#Calculate the number of null values in the first occurrences
first_occurrences.isnull().sum()

Content ID        0
Reaction Type    18
Datetime          0
Content Type      0
Category          0
Sentiment        18
Score            18
dtype: int64

In [96]:
# Drop null values and confirm changes
first_occurrences.dropna(inplace=True)
print(f"Number of First Occurrences | {first_occurrences['Content ID'].nunique()}")
first_occurrences.isnull().sum()

Number of First Occurrences | 962


Content ID       0
Reaction Type    0
Datetime         0
Content Type     0
Category         0
Sentiment        0
Score            0
dtype: int64

In [97]:
# Check if the "Reaction Type" in the first occurrence is blank
has_blank_reaction = first_occurrences["Reaction Type"].isnull()

# Determine if all first occurrences have a blank "Reaction Type"
all_first_blank = has_blank_reaction.all()

# Output the result
print(f'First occurrence of each "Content ID" has a blank "Reaction Type": {all_first_blank}')

First occurrence of each "Content ID" has a blank "Reaction Type": False


***We `Reject` this hypothesis as 962 of the 980 first occurrences of each unique "Content ID" `HAS` a value in the "Reaction Type" column.  This means the `blank rows DO NOT represent the upload dates` of each piece of content.***

### **Phase 2**

***1***- Drop all blank rows and proceed with testing on the dataset equivalent to the previous analysis.

***2***- Take a random unique `"Content ID"` and see how many times across how many months this appears. 

***3***- Retrieve the average amount of times a piece of content appears each month. 

In [98]:
test_df.dropna(inplace=True)
test_df.isnull().sum()

Content ID       0
Reaction Type    0
Datetime         0
Content Type     0
Category         0
Sentiment        0
Score            0
dtype: int64

In [99]:
# Confirm "test_df" is the same as "merged_df" from previous analysis
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 24573 entries, 0 to 24572
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Content ID     24573 non-null  object        
 1   Reaction Type  24573 non-null  object        
 2   Datetime       24573 non-null  datetime64[ns]
 3   Content Type   24573 non-null  object        
 4   Category       24573 non-null  object        
 5   Sentiment      24573 non-null  object        
 6   Score          24573 non-null  float64       
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 1.5+ MB


In [100]:
# Confirm the number of unique "Content ID" values match previous analysis
print(f"Number of Unique Content ID's | {test_df['Content ID'].nunique()}")

Number of Unique Content ID's | 962


In [101]:
# Random "Content ID" to check
content_id_to_check = 'fada6910-2cc5-4600-808c-1e6066f795a6'

# Filter the DataFrame for the specified Content ID
filtered_data = test_df[test_df['Content ID'] == content_id_to_check].copy()

# Extract month and year from the "Datetime" column using .loc
filtered_data.loc[:, 'Month'] = filtered_data['Datetime'].dt.month_name()
filtered_data.loc[:, 'Year'] = filtered_data['Datetime'].dt.year

# Count occurrences by month and year
monthly_counts = filtered_data.groupby(['Year', 'Month']).size().reset_index(name='Occurrences')

# Pivot the data to get months as columns
result_df = monthly_counts.pivot(index='Year', columns='Month', values='Occurrences').fillna(0)

# Add a column for total occurrences per year
result_df['Total Occurrences'] = result_df.sum(axis=1)

# Display the result as a DataFrame
result_df = result_df.astype(int)  # Convert to integer if needed
result_df

Month,April,August,December,February,January,July,June,March,May,November,October,Total Occurrences
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2020,0,2,1,0,0,3,0,0,0,1,1,8
2021,3,0,0,3,2,0,1,1,2,0,0,12


In [102]:
# Create Month-Year combinations for each Content ID
test_df['Month'] = test_df['Datetime'].dt.month_name()
test_df['Year'] = test_df['Datetime'].dt.year

# Count occurrences
monthly_counts = test_df.groupby(['Year', 'Month', 'Content ID']).size().reset_index(name='Occurrences')

# Create a complete DataFrame for merging using a Cartesian product
all_months = test_df['Month'].unique()
all_content_ids = test_df['Content ID'].unique()
complete_df = pd.MultiIndex.from_product([all_content_ids, all_months], names=['Content ID', 'Month']).to_frame(index=False)

# Merge and fill missing values
merged_counts = complete_df.merge(monthly_counts, on=['Content ID', 'Month'], how='left').fillna(0)

# Calculate the average occurrences per month, including months with zero occurrences
average_counts = merged_counts.groupby('Content ID')['Occurrences'].mean().reset_index(name='Avg Monthly Occurrences')

# Calculate the number of months each Content ID appeared in
months_appeared = merged_counts[merged_counts['Occurrences'] > 0].groupby('Content ID')['Month'].nunique().reset_index(name='Months Appeared')

# Calculate total occurrences
total_occurrences = test_df['Content ID'].value_counts().reset_index()
total_occurrences.columns = ['Content ID', 'Total Occurrences']

# Merge average, total occurrences, and months appeared
result_df = average_counts.merge(total_occurrences, on='Content ID').merge(months_appeared, on='Content ID')

# Display the result
result_df.head(10)

Unnamed: 0,Content ID,Avg Monthly Occurrences,Total Occurrences,Months Appeared
0,004e820e-49c3-4ba2-9d02-62db0065410c,0.083333,1,1
1,00d0cdf9-5919-4102-bf84-ebde253c3cd2,3.538462,46,12
2,01396602-c759-4a17-90f0-8f9b3ca11b30,3.076923,40,12
3,019b61f4-926c-438e-adaf-6119c5eab752,1.083333,13,11
4,01ab84dd-6364-4236-abbb-3f237db77180,0.083333,1,1
5,01aff5ec-2aa8-412e-99ec-526f0f9a6d5e,3.307692,43,11
6,02664d35-87cf-46a6-a80b-78fbc9ac8b2f,2.916667,35,10
7,026973ef-4b73-4901-9160-bc9e04516057,3.0,36,10
8,0289b7bc-bc95-4f1d-83fe-9ca662194158,2.5,30,11
9,02ba5af1-784a-44cc-ae3a-14833c4a2237,3.692308,48,12


In [103]:
# Get the row with the maximum total occurrences
max_occurrence_row = result_df.loc[result_df['Total Occurrences'].idxmax()]

# Get the row with the minimum total occurrences
min_occurrence_row = result_df.loc[result_df['Total Occurrences'].idxmin()]

# Create a DataFrame to display the results
max_min_df = pd.DataFrame([max_occurrence_row, min_occurrence_row])

# Display the result
max_min_df

Unnamed: 0,Content ID,Avg Monthly Occurrences,Total Occurrences,Months Appeared
9,02ba5af1-784a-44cc-ae3a-14833c4a2237,3.692308,48,12
0,004e820e-49c3-4ba2-9d02-62db0065410c,0.083333,1,1


In [104]:
# Calculate the overall average monthly occurrences
monthly_average = result_df['Avg Monthly Occurrences'].mean()
overall_average = result_df['Total Occurrences'].mean()

# Display the result
print(f'Overall Average Monthly Occurrences | {monthly_average:.2f}')
print(f'Overall Average Total Occurrences | {overall_average:.2f}')

Overall Average Monthly Occurrences | 2.04
Overall Average Total Occurrences | 25.54


<br>

# **Results**

The analysis results indicate that the 980 rows with blanks in the 'Reaction Type' column `do not represent the original upload dates for each piece of content`, as initially hypothesized. In Phase 1, I attempted to confirm this by checking if the first occurrence of each unique 'Content ID' contained blanks in the 'Reaction Type', 'Sentiment', and 'Score' columns. However, this hypothesis was disproven, as `962 first occurrences of unique 'Content IDs' had values in the 'Reaction Type' column.`

Moving to Phase 2, I tested the alternative hypothesis that each 'Content ID' does not represent upload dates but is simply instances in which a piece of content received a reaction. My analysis showed that a typical 'Content ID' appeared multiple times across several months. A sample 'Content ID' appeared 20 times over 11 months. Additionally, `Content IDs occurrences ranged` from a minimum of `1 to a maximum of 48` per month, `averaging 25.54` appearances across the dataset or `2.04 times per month`.

These findings support the conclusion that unique "Content IDs" likely indicate posts receiving reactions rather than unique upload events. This repetition across months aligns with the behavior of ongoing engagement rather than singular posting instances. Therefore, I cannot confirm how many posts were made within a given month but rather can confirm the number of posts received reactions per month.