In [None]:
import pandas as pd
import itertools
import spacy
import nltk
from nltk.corpus import stopwords
from collections import Counter
import matplotlib.pyplot as plt
from wordcloud import WordCloud


## Load Data

For now, let work with posts, would be nice if you could replicate this notebook with comments. 

In [None]:
data=pd.read_csv('the-reddit-climate-change-dataset-posts.csv')

For now let us only consider the following variables

* subreddit.name
* created_utc
* permalink
* domain
* url
* selftext
* title
* score


**Task**: 

Add in this notebook, based on the description of the dataset provided in kaggle, the meaning of each of the variables above.
For instance, `score` is the number of votes given to the post and answer: Why do you think we had no use for the rest of them?

In [None]:
slected_variables=["subreddit.name", "created_utc", "permalink", "domain", "url", "selftext", "title", "score"]

In [None]:
data=data[slected_variables].copy()

**Task**: 


Could you add a conclusion based on `data.info()` below ?

In [None]:
data.info()

Here is an example of an entry in our dataset:

In [None]:
data.iloc[0]

Example of a post

In [None]:
data["title"].iloc[0]

Since we are interested in the users behaviour over a certain period of time, sometimes there is a need to create eitherc more informative variables or to transform the ones we already have. This is sometimes refer in ML as *feature engineering*.

For example, it isn't strightforward to identy the date with `created_utc` as it is, we must consider a transformation to have a more interpretable representation, in our case is `pd.to_datetime(data['created_utc'], unit='s')` (*Look it up*)

In [None]:
data['created_utc'] = pd.to_datetime(data['created_utc'], unit='s')
data['year'] = data['created_utc'].dt.year

In [None]:
data['year'].unique()

Notice we have posts from 2010 to 2022. For fun, let us only consider before and after 2012, due to a rumor the world was to be ended in such year. 

In [None]:
data_bf=data[data['year']<=2012].copy()
data_af=data[data['year']>2012].copy()

****

# Exploratory Data Analysis

The function below returns the frequency of each category.  

In [None]:
def count_categories(categories):
    category_counts = {}
    for category in categories:
        if category in category_counts:
            category_counts[category] += 1
        else:
            category_counts[category] = 1

    return list(category_counts.items())

In [None]:
categories_count = count_categories(data['subreddit.name']) 
print(categories_count)

Lets now see if the most frequent categories changed after 2012.

In [None]:
categories_count = count_categories(data_bf['subreddit.name']) 
filtered_categories = list(itertools.filterfalse(lambda x: x[1] <= 100, categories_count))
word_freq_dict = dict(filtered_categories)
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq_dict)
plt.figure(figsize=(10, 5))
plt.title("Most frequent categories before 2012")
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  
plt.show()

In [None]:
categories_count = count_categories(data_af['subreddit.name']) 
filtered_categories = list(itertools.filterfalse(lambda x: x[1] <= 5000, categories_count))
word_freq_dict = dict(filtered_categories)
wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(word_freq_dict)
plt.figure(figsize=(10, 5))
plt.title("Most frequent categories after 2012")
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')  
plt.show()

**Task**: 

Add a conclusion below with the differences (if exists) between before and after 2012 most frequent categories. Is the code the same? what changed?

## Finding Entities  

In [None]:
# For now, lets only use complete data. 
data = data.dropna(subset=['selftext'])

In [None]:
data=data.sample(n=1000).reset_index(drop=True)

In [None]:
nltk.download('stopwords')

In [None]:
# Load Spacy model and stopwords
nlp = spacy.load("en_core_web_sm")
stop_words = set(stopwords.words('english'))  # Set for faster lookup

In [None]:
# Remove stopwords from the text
data['selftext'] = data['selftext'].apply(lambda x: ' '.join(word for word in x.split() if word not in stop_words))

In [None]:
# Function to extract locations (GPE entities) from text
def extract_locations(text):
    return [ent.text for ent in nlp(text).ents if ent.label_ == 'GPE']

In [None]:
# Extract locations from the 'body' column
locations = [loc for sublist in data['selftext'].apply(extract_locations) for loc in sublist]

In [None]:
# Get top 50 most common locations
most_common_50 = Counter(locations).most_common(50)

In [None]:
# Prepare data for plotting
all_locs = [loc for loc, _ in most_common_50]
num_loc_mentions = [count * 50 for _, count in most_common_50]
avg_loc_sents = [data[data['selftext'].str.contains(loc, regex=False)]['score'].mean() for loc in all_locs]

In [None]:
# Plotting the data
plt.figure(figsize=(20, 6))
plt.scatter(all_locs, avg_loc_sents, s=num_loc_mentions, alpha=0.5)
# Add titles and labels
plt.title("Average Score For Top 50 Mentioned Locations and Number of Mentions in First 1000 Rows")
plt.xlabel("Top 50 Mentioned Locations")
plt.xticks(rotation=90)
plt.ylabel("Average Sentiment")
plt.legend(["Bubble Size = Number of Mentions"])
plt.show()


**Task**: 

Do you think this plot is okay as it is? Should we filtered it more? how? Try to refine the above plot using your own logic (have fun with it).

In [None]:
#START NAMES
def extract_names(text):
    doc = nlp(text)
    return [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

In [None]:
persons = [person for sublist in data['selftext'].apply(extract_names) for person in sublist]
most_common_50 = Counter(persons).most_common(50)

# Prepare data for plotting
all_persons = [person for person, _ in most_common_50]
num_person_mentions = [count * 50 for _, count in most_common_50]
avg_loc_scores = [data[data['selftext'].str.contains(person, regex=False)]['score'].mean() for person in all_persons]

In [None]:
# Plotting the data
plt.figure(figsize=(20, 6))
plt.scatter(all_persons, avg_loc_scores, s=num_person_mentions, alpha=0.5)
# Add titles and labels
plt.title("Average Score For Top 50 Mentioned persons and Number of Mentions in First 1000 Rows")
plt.xlabel("Top 50 Mentioned Locations")
plt.xticks(rotation=90)
plt.ylabel("Average Sentiment")
plt.legend(["Bubble Size = Number of Mentions"])
plt.show()

**Task**: 

Add a conclusion

# Extra: 

This is some preview of our future work, but keep them in mind and try to find some other interesting questions or suggest ideas to answer the ones below.

Strongly suggest to use chatGPT.

1. What is the number of posts per year?
2. Average score (number of votes) per year per category ?
3. We want to measure popularity and influence of the found persons.  How can we measure popularity and/or Influence?
5. How people are affected and cope with climate change? For example, who engages in these kind of conversations, can we say something about their age or context of the authors of these posts? (This also called author profiling)
6. How can we identify false information or fake news with this data?