# Social Media Sentiment Analysis Project
-> Problem Statement (Final Step After Conclusion - Rehab Responsibility)

---------------

## Step 1: Reading and Understanding the data.
Assigned to: Akayiz/Youstina

In [2]:
import pandas as pd

In [3]:
df = pd.read_csv('sentimentdataset.csv')

In [4]:
df.shape

(732, 14)

In [5]:
df.head()

Unnamed: 0,ID,Text,Sentiment (Label),Timestamp,User,Source,Topic,Retweets,Likes,Country,Year,Month,Day,Hour
0,0,Enjoying a beautiful day at the park! ...,Positive,1/15/2023 12:30,User123,Twitter,#Nature #Park,15,30,USA,2023,1,15,12
1,1,Traffic was terrible this morning. ...,Negative,1/15/2023 8:45,CommuterX,Twitter,#Traffic #Morning,5,10,Canada,2023,1,15,8
2,2,Just finished an amazing workout! 💪 ...,Positive,1/15/2023 15:45,FitnessFan,Instagram,#Fitness #Workout,20,40,USA,2023,1,15,15
3,3,Excited about the upcoming weekend getaway! ...,Positive,1/15/2023 18:20,AdventureX,Facebook,#Travel #Adventure,8,15,UK,2023,1,15,18
4,4,Trying out a new recipe for dinner tonight. ...,Neutral,1/15/2023 19:55,ChefCook,Instagram,#Cooking #Food,12,25,Australia,2023,1,15,19


In [6]:
list(df.columns)

['ID',
 'Text',
 'Sentiment (Label)',
 'Timestamp',
 'User',
 'Source',
 'Topic',
 'Retweets',
 'Likes',
 'Country',
 'Year',
 'Month',
 'Day',
 'Hour']

## The Data's Columns' Description:
- <b>ID:</b> An integer that represents a unique identifier for each row (0-based)
- <b>Text:</b> A string that contains the posted message
- <b>Sentiment (Label):</b> Message author's feelings when writing the message
- <b>Timestamp:</b> Timestamp when the message was posted (mm/dd/yyyy hh:mm (in 24-hour format))
- <b>User:</b> Author's account name
- <b>Source:</b> Source website where the author posted the message
- <b>Topic:</b>  Topic related to the content in the message (two topics)
- <b>Retweets:</b> Number of retweets/reposts on social media platform
- <b>Likes:</b> Number of likes on social media platform
- <b>Country:</b> Country where the author is located
- <b>Year:</b> Year when the message was posted
- <b>Month:</b> Month when the message was posted
- <b>Day:</b> Day when the message was posted
- <b>Hour:</b> The hour (in 24-hour format) in which this message was posted

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 732 entries, 0 to 731
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   ID                 732 non-null    int64 
 1   Text               732 non-null    object
 2   Sentiment (Label)  732 non-null    object
 3   Timestamp          732 non-null    object
 4   User               732 non-null    object
 5   Source             732 non-null    object
 6   Topic              732 non-null    object
 7   Retweets           732 non-null    int64 
 8   Likes              732 non-null    int64 
 9   Country            732 non-null    object
 10  Year               732 non-null    int64 
 11  Month              732 non-null    int64 
 12  Day                732 non-null    int64 
 13  Hour               732 non-null    int64 
dtypes: int64(7), object(7)
memory usage: 80.2+ KB


#### Dropping unnecessary columns

- **ID:** Redundant since pandas generates an equivalent row
- **Timestamp:** All the info inside it is available in columns (Year, Month, Day, Hour) except the minutes, which we believe is not useful since no trends last for less than an hour
- **User:** Isn't related to the sentiment

In [10]:
df.drop(['ID', 'Timestamp', 'User'], axis=1, inplace=True)

#### Checking missing/null values

In [8]:
df.isna().sum().sort_values(ascending = False)

ID                   0
Text                 0
Sentiment (Label)    0
Timestamp            0
User                 0
Source               0
Topic                0
Retweets             0
Likes                0
Country              0
Year                 0
Month                0
Day                  0
Hour                 0
dtype: int64

### Based on the above; we can deduce that we have no null values in our data

***************************************************************
#### Description of numeric data

In [14]:
df.describe()

Unnamed: 0,Retweets,Likes,Year,Month,Day,Hour
count,732.0,732.0,732.0,732.0,732.0,732.0
mean,21.508197,42.901639,2020.471311,6.122951,15.497268,15.521858
std,7.061286,14.089848,2.802285,3.411763,8.474553,4.113414
min,5.0,10.0,2010.0,1.0,1.0,0.0
25%,17.75,34.75,2019.0,3.0,9.0,13.0
50%,22.0,43.0,2021.0,6.0,15.0,16.0
75%,25.0,50.0,2023.0,9.0,22.0,19.0
max,40.0,80.0,2023.0,12.0,31.0,23.0


### Checking some columns and noting issues in them

``Text Column:``

In [16]:
df['Text'].head()

0     Enjoying a beautiful day at the park!        ...
1     Traffic was terrible this morning.           ...
2     Just finished an amazing workout! 💪          ...
3     Excited about the upcoming weekend getaway!  ...
4     Trying out a new recipe for dinner tonight.  ...
Name: Text, dtype: object

#### Text column doesn't contain any irrelevent data

``Sentiment (Label):``

In [21]:
df['Sentiment (Label)'].value_counts()

Sentiment (Label)
Positive           44
Joy                42
Excitement         32
Neutral            14
Contentment        14
                   ..
Adrenaline          1
Harmony             1
ArtisticBurst       1
Radiance            1
Elegance            1
Name: count, Length: 279, dtype: int64

#### Need to convert 279 different labels into one of three: positive, negative or neutral

``Source:``

In [23]:
df['Source'].value_counts()

Source
Instagram     258
Facebook      231
Twitter       128
Twitter       115
Name: count, dtype: int64

#### Need to merge the two "Twitter"'s together (Note: 1 of the twitter's has a single space after it and the other has 2 spaces after it)

``Topic:``

In [24]:
df['Topic'].value_counts()

Topic
#Compassionate #TearsOfEmpathy                  3
#Proud #ScalingPeaks                            3
#Hopeful #SeedsOfOptimism                       3
#Playful #CarnivalEscapade                      3
#Contentment #TranquilWaters                    2
                                               ..
#Acceptance #BeautifulChaos                     1
#Determination #ExtraordinaryPath               1
#Serenity #RaindropMelody                       1
#Curiosity #SeekerOfKnowledge                   1
#VirtualEntertainment #HighSchoolPositivity     1
Name: count, Length: 697, dtype: int64

In [25]:
df['Topic'].head()

0     #Nature #Park                            
1     #Traffic #Morning                        
2     #Fitness #Workout                        
3     #Travel #Adventure                       
4     #Cooking #Food                           
Name: Topic, dtype: object

#### Topic column needs to be split into two, as for example the combo of #Compassionate #TearsOfEmpathy occured 3 times, but that doesn't mean that #Compassionate occured 3 times only 

#### 

``Retweets:``

In [27]:
df['Retweets'].value_counts().sort_index(ascending=False)

Retweets
40     16
35     41
30     55
28     50
27      2
26      5
25     75
24      8
23     12
22    106
21     10
20     67
19      9
18     93
17      5
16      9
15     74
14     15
13      5
12     29
11      1
10     21
9       3
8      12
7       7
5       2
Name: count, dtype: int64

``Likes:``

In [28]:
df['Likes'].value_counts().sort_index(ascending=False)

Likes
80    16
70    41
60    55
55    50
52     2
51     1
50    75
49     1
48     9
47     1
46     6
45    94
44    10
43     6
42     9
41     2
40    62
39     6
38     7
37     3
36    16
35    77
34     4
33     1
32     8
31     1
30    73
28    15
27     1
26     5
25    23
24     6
22     1
20    21
18     3
16     2
15    17
10     2
Name: count, dtype: int64

``Country:``

In [30]:
df['Country'].value_counts()

Country
USA               59
USA               55
UK                49
Canada            44
Australia         41
                  ..
Netherlands        1
USA                1
Germany            1
France             1
USA                1
Name: count, Length: 115, dtype: int64

#### Need to combine duplicated country keys into 1 key (Remove multiple USAs and make them 1 USA) (Note: The trick here is also with trailing spaces after each country)

``Year:``

In [31]:
df['Year'].value_counts().sort_index(ascending=False)

Year
2023    289
2022     63
2021     63
2020     69
2019     73
2018     56
2017     43
2016     38
2015     19
2014      4
2013      4
2012      4
2011      4
2010      3
Name: count, dtype: int64

``Month:``

In [32]:
df['Month'].value_counts().sort_index(ascending=False)

Month
12    39
11    49
10    48
9     77
8     78
7     62
6     71
5     46
4     51
3     44
2     85
1     82
Name: count, dtype: int64

``Day:``

In [33]:
df['Day'].value_counts().sort_index(ascending=False)

Day
31     5
30    23
29    11
28    59
27    12
26    10
25    23
24    11
23    10
22    39
21    10
20    39
19    14
18    49
17    17
16    11
15    73
14    13
13     7
12    38
11    11
10    63
9      5
8     34
7     11
6      7
5     48
4      5
3     21
2     27
1     26
Name: count, dtype: int64

``Hour:``

In [34]:
df['Hour'].value_counts().sort_index(ascending=False)

Hour
23     7
22    33
21    41
20    50
19    75
18    65
17    48
16    69
15    47
14    94
13    30
12    38
11    37
10    30
9     28
8     23
7      7
6      4
5      1
3      3
2      1
0      1
Name: count, dtype: int64

------

## Step 2: Data Cleaning and Preprocessing
Assigned to: David/Zedan

-----

## Step 3: Exploratory Data Analysis
Assigned to: Amany/Rehab

------

## Step 4: Premodelling Phase

-------

## Step 5: Modelling

-----

## Conclusion