# Business Understanding

## Overview



The Paris Olympics is a global sporting event that has garnered significant attention and engagement across various social media platforms. Analyzing public sentiment regarding the Olympics can provide valuable insights into how athletes, countries and the overall event are perceived. This analysis can benefit sports organizations, media outlets, sponsors offering feedback on public perception, performance and engagement levels thus helping to tailor content and marketing strategies. Sentiment analysis can also benefit city officials to improve planning and address concerns such as health and sanitation.
The goal of this project is to perform a comprehensive sentiment analysis of social media content related to this year's Paris Olympics to understand public sentiment, identify emerging trends and provide a comprehensive understanding of how different aspects of the Olympics resonate with audiences worldwide.

## Problem Statement



The Paris Olympics is a high-profile event that generates a substantial volume of unstructured social media data that reflects public sentiment. The challenge lies in effectively analyzing this vast and diverse stream of data while also tackling challenges such as language differences, sentiment variations and contextual meanings in order to provide accurate and actionable insights. 

## Proposed Solutions



1.	Use API access to collect data from major social media platforms and ensure compliance with platform policies and data protection regulations.
2.	Implement text normalization, tokenization and content filtering while utilizing language detection and translation tools for multilingual data handling. 
3.	Employ advanced natural language processing models like BERT or GPT for sentiment classification incorporating sarcasm detection and contextual analysis for improved accuracy. 
4.	Create an interactive dashboard using Tableau to display sentiment trends and insights with features for data filtering and exploring different aspects of the data.

## Objectives




### Main Objective
Develop a comprehensive social media sentiment analysis model that accurately captures and interprets public sentiment about the Paris Olympics from social media data.

### Specific Objectives
1.	To extract, preprocess and clean social media data from multiple platforms addressing quality issues and handling multilingual content related to the Paris Olympics. 
2.	To develop and train advanced natural language processing models to accurately classify sentiments incorporating techniques to handle sarcasm and contextual nuances. 
3.	To create interactive visualizations to display sentiment trends and key events providing actionable insights to stakeholders based on comprehensive analysis of public opinions.

## Success Metrics



- Accuracy – The proportion of correctly classified sentiments (positive, negative, neutral) out of all sentiments predicted by the model.
85% - 90%

- Precision -  The proportion of true positive sentiment predictions (correctly identified positive tweets) out of all predicted positives.
80% - 90% for both positive and negative sentiment classes.
75% - 85% for the neutral class.

- Recall - The proportion of true positive sentiment predictions out of all actual positives.
75% - 80% for all sentiment classes.

- F1 Score - The harmonic mean of Precision and Recall that provides a single metric that balances both precision and recall.
0.75 to 0.85

- Area Under the Curve - Receiver Operating Characteristic (AUC-ROC) - Measures how well a model distinguishes between classes. > 0.85

## Challenges


- Social media data is noisy and unstructured presenting challenges for accurate analysis. 
- Distinguishing between positive, negative and neutral sentiments can be difficult especially when dealing with multilingual content thus affecting sentiment analysis accuracy
- The volume of social media posts and comments can be overwhelming particularly during major events like the Olympics. Managing and processing large volumes of real-time data necessitates efficient data handling and processing techniques. 
- Interpreting context and sarcasm an extra layer of complexity as the sentiment expressed may not always align with the literal meaning of the words used. Social media content often includes informal language, slang and nuanced expressions that can skew sentiment analysis

## Conclusion



This sentiment analysis project aims to deliver a comprehensive understanding of public opinion about the Paris Olympics by leveraging social media data. By addressing the challenges of data quality, sentiment accuracy, multilingual content and implementing advanced NLP techniques, the project will provide actionable insights to the aforementioned stakeholders. Successful execution will enable better engagement strategies and enhance the overall experience of the Olympics for audiences worldwide.

## DATA UNDERSTANDING




### Data Sources
1.	APIs - Extract data from social media sites such as Twitter, Facebook and Instagram in the form of posts, tweets, comments and hashtags using their respective APIs. 
The focus will be on posts mentioning Paris Olympics, relevant hashtags and location-based data.

2.	Web Scraping - Extract additional data from comments and discussions from news sites and sports forums such as ESPN and Sports Center

### Datasets



1.	Social media data in the form of tweets, facebook and Instagram posts and comments mentioning the Paris Olympics.
2.	News articles, comments and replies discussing the various aspects of the Olympics.

### Relevance of The Data



>The data sources and datasets identified for this project are highly relevant to analyzing public sentiment surrounding the Paris Olympics. Social media platforms like Twitter, Facebook and Instagram capture immediate reactions, discussions and emotional responses from a global audience thus providing a rich source of unfiltered public sentiment. 
The inclusion of location-based data and relevant hashtags allows for more targeted analysis potentially revealing geographical trends and topic-specific sentiments. Complementing this with web scraping of news sites and sports forums like ESPN and Sports Center adds depth to the analysis by incorporating more structured discussions and content.
This combination of data sources offers a comprehensive view of public sentiment ranging from spontaneous reactions on social media to more considered opinions in news comment sections and sports forums.

#### Python Modules Importation

In [None]:
# import necessary modules
# manupulation
import pandas as pd
import os
import numpy as np

#### Combining CSV Files

In [15]:
# Folder in which the csv files are found
folder_path = 'X_data'

# List of all the csv files found in the folder
csv_files = [fil for fil in os.listdir(folder_path) if fil.endswith('.csv')]

# Create an empty list to store DataFrames
data_frames = []

# Loop through the list of CSV files and read each one into a DataFrame
for file in csv_files:
    file_path = os.path.join(folder_path, file)
    df = pd.read_csv(file_path)
    data_frames.append(df)

# Concatenate all DataFrames into one (stack vertically)
merged_df = pd.concat(data_frames, ignore_index=True, sort=False)

merged_df.info()

# Save the concatenated DataFrame to a new CSV file
new_df = 'merged_file.csv'
merged_df.to_csv(new_df, index=False)
# print(f'merged CSV file saved as {output_file}')



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2600 entries, 0 to 2599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   id             2600 non-null   int64 
 1   tweetText      2600 non-null   object
 2   tweetURL       2600 non-null   object
 3   type           2600 non-null   object
 4   tweetAuthor    2600 non-null   object
 5   handle         2600 non-null   object
 6   geo            1573 non-null   object
 7   mentions       1177 non-null   object
 8   hashtags       1024 non-null   object
 9   replyCount     2600 non-null   int64 
 10  quoteCount     2600 non-null   int64 
 11  retweetCount   2600 non-null   int64 
 12  likeCount      2600 non-null   int64 
 13  views          2600 non-null   object
 14  bookmarkCount  2600 non-null   int64 
 15  createdAt      2600 non-null   object
 16  allMediaURL    578 non-null    object
 17  videoURL       136 non-null    object
dtypes: int64(6), object(12)
memo

In [16]:
data = DataUnderstanding()
df = data.load_data(path="merged_file.csv")
# First five rows of dataset
df.head()

Unnamed: 0,id,tweetText,tweetURL,type,tweetAuthor,handle,geo,mentions,hashtags,replyCount,quoteCount,retweetCount,likeCount,views,bookmarkCount,createdAt,allMediaURL,videoURL
0,1820934767312859634,Kellie Harrington homecoming after Tokyo 2020 ...,https://x.com/HonestFrank/status/1820934767312...,tweet,Francis Keogh,@HonestFrank,,,"#Olympics,#boxing",0,0,0,0,-,0,2024-08-07 00:27:34,https://pbs.twimg.com/media/E8dKeTiWQAAdJVN.jpg,https://video.twimg.com/amplify_video/14251889...
1,1820934764871794856,ÎÎµ ÏÏÎ¿Ï Î¤ÎÎÎÎÎ©Î£ÎÎ¤Î Î¼ÏÎ®ÎºÎµ ...,https://x.com/WaltersPa4652/status/18209347648...,tweet,Paulette Walters,@WaltersPa4652,,,"#TeamHellas,#paris2024,#Olympics,#paris2024gr",0,0,0,0,-,0,2024-08-07 00:27:33,,
2,1820934747964559415,@PopBase #Olympics USA is good but will be beaten,https://x.com/JayGaran/status/1820934747964559415,tweet,Jay Garan ð°ðªð°ðª,@JayGaran,"Mombasa, Kenya",@PopBase,#Olympics,0,0,0,0,-,0,2024-08-07 00:27:29,,
3,1820934747280859532,ÎÎµ ÏÏÎ¿Ï Î¤ÎÎÎÎÎ©Î£ÎÎ¤Î Î¼ÏÎ®ÎºÎµ ...,https://x.com/WaltersPa4652/status/18209347472...,tweet,Paulette Walters,@WaltersPa4652,,,"#TeamHellas,#paris2024,#Olympics,#paris2024gr",0,0,0,0,-,0,2024-08-07 00:27:29,,
4,1820934736237199545,#Olympics Women's semi-final:\nBrazil are flyi...,https://x.com/BabaGol_/status/1820934736237199545,tweet,BabaGol,@BabaGol_,,,#Olympics,0,0,0,0,-,0,2024-08-07 00:27:27,https://pbs.twimg.com/media/GUVCeZMXEAAjpsA.jp...,


In [None]:
c