# Analysis Project on Data : Global Conflict Hashtag on Social Media
- **Linkedin :** Muhammad Aditya Bayhaqie
- **Email :** adityabayhaqie@gmail.com
- **Github :** bayhaqieee

## Business Questions

- When is the Highest Posting time and Lowest Posting time for the Most Hashtag?
- Which Conflict shows a High Exposure on Hashtag?
- What Content Summary for each Tag?
- Which Hashtag has the Most Comments, Likes and Views? (separated)
- Which Higher Engagement of Social Media Used for sharing the Hashtags?

## Data Preparation

#### Importing Library

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from datetime import datetime
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

#### Assigning Data

In [3]:
data_df = pd.read_csv("Data/conflicts_hashtag_search.csv")
data_df.head()

Unnamed: 0,fromSocial,text,likesCount,commentsCount,viewsCount,input,authorMeta/name,creationDate
0,youtube,,,,,yemencivilwar,,
1,youtube,Russia-Ukraine Conflict: Putin Warns NATO Risk...,22.0,7.0,3069.0,russiaukraineconflict,CNN-News18,2024-09-13T16:23:39.000Z
2,youtube,LIVE: Russia Launches Waves of Drone Attacks o...,118.0,21.0,16358.0,russiaukraineconflict,Firstpost,2024-10-01T00:17:49.000Z
3,youtube,Russian Forces Take Over Ukraine's Avdiivka | ...,166.0,56.0,18054.0,russiaukraineconflict,CNBC-TV18,2024-02-19T15:37:34.000Z
4,youtube,Russia-Ukraine War: Ukraine's Surprise Attack ...,53.0,4.0,6096.0,russiaukraineconflict,DD India,2024-08-24T13:36:51.000Z


**Insight:**
- There is in Total 1 Datasets Containing
    - fromSocial
        - Referred to which social media the posting being post
    - text
        - The Headline
    - likesCount
        - Numbers of Likes
    - commentsCount
        - Numbers of Comments
    - viewsCount
        - Numbers of Views
    - input
        - Hashtag (**PRIMARY**)
    - authorMeta/name
        - The Posters
    - creationDate
        - The Date of the Postings Created

## Data Assessment

### Assessing Data

In [4]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2546 entries, 0 to 2545
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fromSocial       2546 non-null   object 
 1   text             2543 non-null   object 
 2   likesCount       2545 non-null   float64
 3   commentsCount    2545 non-null   float64
 4   viewsCount       321 non-null    float64
 5   input            2546 non-null   object 
 6   authorMeta/name  2483 non-null   object 
 7   creationDate     2545 non-null   object 
dtypes: float64(3), object(5)
memory usage: 159.3+ KB


In [5]:
data_df.isna().sum()

fromSocial            0
text                  3
likesCount            1
commentsCount         1
viewsCount         2225
input                 0
authorMeta/name      63
creationDate          1
dtype: int64

In [6]:
data_df[['likesCount','commentsCount','viewsCount']].describe()

Unnamed: 0,likesCount,commentsCount,viewsCount
count,2545.0,2545.0,321.0
mean,712.897839,148.247544,252216.9
std,7323.585293,2475.833944,1023982.0
min,-1.0,0.0,73.0
25%,1.0,0.0,5106.0
50%,8.0,0.0,21653.0
75%,98.0,3.0,78378.0
max,219000.0,91917.0,11227140.0


In [9]:
minus_likes = data_df[data_df['likesCount'] == -1]

minus_likes.info()
minus_likes.head()

<class 'pandas.core.frame.DataFrame'>
Index: 96 entries, 53 to 2449
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fromSocial       96 non-null     object 
 1   text             96 non-null     object 
 2   likesCount       96 non-null     float64
 3   commentsCount    96 non-null     float64
 4   viewsCount       0 non-null      float64
 5   input            96 non-null     object 
 6   authorMeta/name  96 non-null     object 
 7   creationDate     96 non-null     object 
dtypes: float64(3), object(5)
memory usage: 6.8+ KB


Unnamed: 0,fromSocial,text,likesCount,commentsCount,viewsCount,input,authorMeta/name,creationDate
53,instagram,"On October 7, 2023, Israel faced an unexpected...",-1.0,1.0,,israelpalestineconflict,The Indian Netizens,2024-10-07T08:10:50.000Z
57,instagram,🕯️ 🕊️\n.\n.\n.\n.\n#peacebuilding #conflictres...,-1.0,13.0,,israelpalestineconflict,Amir Sommer,2024-10-07T05:39:41.000Z
135,instagram,"🕌 De acordo com a imprensa internacional, os H...",-1.0,0.0,,yemencivilwar,"Conversas com a História | Guerras, História e...",2024-01-05T17:31:40.000Z
136,instagram,#freepalestine🇵🇸 #freecongo🇨🇩 #freesudan🇸🇩 #fr...,-1.0,0.0,,yemencivilwar,Laila Imani,2023-11-15T00:18:27.000Z
138,instagram,#freepalestine🇵🇸 #freecongo🇨🇩 #freesudan🇸🇩 #fr...,-1.0,0.0,,yemencivilwar,Laila Imani,2023-11-15T00:15:44.000Z


In [10]:
max_likes = data_df[data_df['likesCount'] == 219000]

max_likes.info()
max_likes.head()

<class 'pandas.core.frame.DataFrame'>
Index: 1 entries, 269 to 269
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fromSocial       1 non-null      object 
 1   text             1 non-null      object 
 2   likesCount       1 non-null      float64
 3   commentsCount    1 non-null      float64
 4   viewsCount       1 non-null      float64
 5   input            1 non-null      object 
 6   authorMeta/name  1 non-null      object 
 7   creationDate     1 non-null      object 
dtypes: float64(3), object(5)
memory usage: 72.0+ bytes


Unnamed: 0,fromSocial,text,likesCount,commentsCount,viewsCount,input,authorMeta/name,creationDate
269,youtube,PART 2: Andrew Tate Talks Palestine and Israel...,219000.0,91917.0,9051874.0,israelpalestineconflict,Piers Morgan Uncensored,2023-11-21T21:00:04.000Z


**Insight:**
- A large number of Null Values on viewsCount
- Null Values in likesCount, commentsCount, text, authorMeta/Name,creationdate
- Several data containing -1 Likes from its posting while this thing is impossible 

## Data Cleaning

### Database Dataset

- **Issues :** 
    - Several data containing -1 Likes from its posting while this thing is impossible
    - Null Values in likesCount, commentsCount, text, authorMeta/Name,creationdate
    - A large number of Null Values on viewsCount
- **Action :** 
    - Changing Data with -1 Likes to 0 Likes.
    - Data Dropping for likesCount, commentsCount, text, authorMeta/Name and creationdate dataset
    - Adding viewsCount by taking predicted amount based on likesCount and commentsCount

In [11]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2546 entries, 0 to 2545
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fromSocial       2546 non-null   object 
 1   text             2543 non-null   object 
 2   likesCount       2545 non-null   float64
 3   commentsCount    2545 non-null   float64
 4   viewsCount       321 non-null    float64
 5   input            2546 non-null   object 
 6   authorMeta/name  2483 non-null   object 
 7   creationDate     2545 non-null   object 
dtypes: float64(3), object(5)
memory usage: 159.3+ KB


In [12]:
minus_likes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 96 entries, 53 to 2449
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fromSocial       96 non-null     object 
 1   text             96 non-null     object 
 2   likesCount       96 non-null     float64
 3   commentsCount    96 non-null     float64
 4   viewsCount       0 non-null      float64
 5   input            96 non-null     object 
 6   authorMeta/name  96 non-null     object 
 7   creationDate     96 non-null     object 
dtypes: float64(3), object(5)
memory usage: 6.8+ KB


In [13]:
data_df['likesCount'] = data_df['likesCount'].replace(-1, 0)

In [15]:
minus_likes = data_df[data_df['likesCount'] == -1]
minus_likes.info()

<class 'pandas.core.frame.DataFrame'>
Index: 0 entries
Data columns (total 8 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   fromSocial       0 non-null      object 
 1   text             0 non-null      object 
 2   likesCount       0 non-null      float64
 3   commentsCount    0 non-null      float64
 4   viewsCount       0 non-null      float64
 5   input            0 non-null      object 
 6   authorMeta/name  0 non-null      object 
 7   creationDate     0 non-null      object 
dtypes: float64(3), object(5)
memory usage: 0.0+ bytes


In [23]:
data_df.isna().sum()

fromSocial            0
text                  3
likesCount            1
commentsCount         1
viewsCount         2225
input                 0
authorMeta/name      63
creationDate          1
dtype: int64

#### viewsCount Prediction using Linear Regression

In [17]:
# a. Data with non-null viewsCount
train_df = data_df.dropna(subset=['viewsCount'])

# b. Data with null viewsCount (this is the data we want to predict)
predict_df = data_df[data_df['viewsCount'].isnull()]

In [19]:
X_train = train_df[['likesCount', 'commentsCount']]
y_train = train_df['viewsCount']

In [20]:
model = LinearRegression()
model.fit(X_train, y_train)

In [21]:
X_predict = predict_df[['likesCount', 'commentsCount']]

In [22]:
predicted_views = model.predict(X_predict)

ValueError: Input X contains NaN.
LinearRegression does not accept missing values encoded as NaN natively. For supervised learning, you might want to consider sklearn.ensemble.HistGradientBoostingClassifier and Regressor which accept missing values encoded as NaNs natively. Alternatively, it is possible to preprocess the data, for instance by using an imputer transformer in a pipeline or drop samples with missing values. See https://scikit-learn.org/stable/modules/impute.html You can find a list of all estimators that handle NaN values at the following page: https://scikit-learn.org/stable/modules/impute.html#estimators-that-handle-nan-values