# Facebook Post Data - Crash Course

In this notebook, you are going to analyse your own posts from Facebook. We will explore your data, do some pre-processing on text data and finally find out, for example, which words gave your the most likes or given some text, how many reaction could you expect. The result will be personal to you, so that will be really interesting.


So, if you didn't do it yet, go to https://github.com/LuxembourgTechSchool/FacebookPostsToCsv and follow the steps. At the end you should have 2 CSV files named `posts_default.csv` and `reactions_default.csv`.

## 1. Importing Libraries

As always, we start but importing a few basic libraries.

In [None]:
# Import Panda and give it a shorter name "pd"


# Import numpy and give it a shorter name "np"


# Import pyplot from matplotlib and give it a shorter name "plt".


# Tell Jupyter to to print plots in the cell results (Specific to notebooks only)
%matplotlib inline 

## 2. Loading the Data

Next, we load our 2 CSV files.

**Important:** Make sure that you set the right path to point to the posts and reactions CSV files.

In [None]:
posts = pd.read_csv('../data/facebook_data/posts_default.csv')
reactions = pd.read_csv('../data/facebook_data/reactions_default.csv')

Let's take a look inside.

In [None]:
# Print the first 10 posts


In [None]:
print('Size of posts: ', ...)

In [None]:
# Print the first few reactions


In [None]:
print('Size of reactions: ', ...)

## 2.1 Column Descriptions

Starting with Posts, we have the following columns:

- **id:** The unique id of the post.
- **created_time:** The date/time when the posts has been posted.
- **message:** The message of the post.
- **story:** The story of the post (e.g. "John Doe shared a link."). This is different than the message.

Then for the Reactions we have:

- **id:** The unique id of the reaction.
- **name:** The name of the person that reacted to your post.
- **type:** The type of the reaction (LIKE, LOVE, ANGRY, HAHA,...)
- **post:** The id of the post that this reaction is linked to.

Note that some posts might have no reactions at all. We will also find out which one.

# 3. Explore the Data

All right, now let's explore the data a little, extract some information and perform some computations.

## 3.1. How many LIKES, LOVES, HAHA, ANGRY, ...

Let's start simple by counting and exploring how many types of reaction we have.

In [None]:
reactions[:10]

In [None]:
# here we use numpy where request, which means that it returns all reactions that are equal to "LIKE"


# as "reaction_like" is an array inside the tuple we need to check the lenght of the tuple in index 0
len(reaction_like[0])

In [None]:
# now lets see how many of the reactions are "LOVE"


# as "reaction_like" is an array inside the tuple we need to check the lenght of the tuple in index 0
len(reaction_like[0])

In [None]:
# now lets see how many of the reactions are "HAHA"


# as "reaction_like" is an array inside the tuple we need to check the lenght of the tuple in index 0
len(reaction_like[0])

In [None]:
# now lets see how many of the reactions are "ANGRY"


# as "reaction_like" is an array inside the tuple we need to check the lenght of the tuple in index 0
len(reaction_like[0])

In [None]:
# Print the value_counts of reaction types


All right, the numbers are interesting. A plot would be nice too:

In [None]:
# Define variable data with the reaction value counts of types


# Plot the data as pie


# Set the label on the x-axis


# Set the label on the y-axis


# Set the title and font size


# Show the grid


## 3.2. Who is your Biggest Fan?

Let's find out:

- Which friend reacted the most to your posts?
- Which friend likes your posts the most?
- Which friend loves your posts the most?

#### 3.2.1. Which friend reacted the most to your posts

In [None]:
# Easy, let's get the count of rows by name first.
reactions_by_name = ...
reactions_by_name[:10]

In [None]:
# Print the result
if ...
    max_value = ...
    max_name = ...
    print('{} reacted the most to your posts with {} reactions.'.format(max_name, max_value))
else:
    print('No one reacted to your posts.')

#### 3.2.2. Which friend likes your posts the most?

In [None]:
likes_by_name = ...
likes_by_name[:10]

In [None]:
# Print the result
if ...
    max_value = ...
    max_name = ...
    print('{} likes your posts {} times.'.format(max_name, max_value))
else:
    print('No one likes your posts.')

#### 3.2.3. Which friend loves your posts the most?

In [None]:
loves_by_name = ...
loves_by_name[:10]

In [None]:
# Print the result
if ...
    max_value = ...
    max_name = ...
    print('{} likes your posts {} times.'.format(max_name, max_value))
else:
    print('No one loves your posts.')

Great! If you want to find out who reacted the most with ANGRY, HAHA or SAD reactions, you can write the code now or even prepare a function.

In [None]:
# Your turn, try a few things out :)



## 3.3. Did you post things multiple times?

Let's just see if you repost some of your messages. Note that this is not really a duplicate that you need to remove, it is just a message that you posted 2+ times over the years.

In [None]:
print('Number of reposts: {}'.format( ... ))

In [None]:
# Let's see a few of those messages:

query = ...
posts[ query ][:10]

# 4. Pre-Processing

## 4.1. Duplicates

What are duplicates here?

For posts, it would be 2 posts at the same time, with the same message and same ID. From how we get the data, we know it is not really possible. 

Then, removing all posts where the message or story is the same is not a good idea, because, let's say you posted "Happy New Year" every year! That's the same message string but on different dates. So for us, we are going to keep everything.

For scientific reasons, we are still going to print the number of duplicates:

In [None]:
# How many 'full' duplicated rows in posts?


In [None]:
# How many 'full' duplicated rows in reactions?


## 4.2. Missing Values

The same is valid here. Because of the method (Facebook API + Script) that we have used to get the data, we know there are no missing values that would not allow us to process a row.

The only detail to remember is that `message` or `story` can be `Nan` (Empty in pandas/numpy). But this is not a big deal, you will see later.

## 4.3. Format Dates

Each post has a datetime column. Since it is a string we are going to convert it into a real Python datetime object and then we are going to extract a few date components.

In [None]:
# First, check the format


Pandas has a function for working with date/time data. It is explained [here](http://pandas.pydata.org/pandas-docs/version/0.19.1/generated/pandas.to_datetime.html).

It allows us to convert a string to a real `datetime` object representing the date and time. All we have to do is call the method and give it the format of the date/time.

In [None]:
# Month/Day/Year Hours:Minutes/Seconds
datetime_format = '%Y/%m/%dT%H:%M:%S'

What we will do, is just override the 'Date/Time' columns with our new data:

In [None]:
# Check again. Note that the dtype has changed.


We can now get the desired components and store it in new columns. To create a new column, you can just define it the same way as you would define a key in a dictionary.

This is the power of pandas at work. We don't need to loop. Pandas handle everything for us and fills up every row for our new columns.

In [None]:
# checking what do we have now in our database


Great, now we are readz to plot some intersting things.

# 5. Advanced Exploration

Let's go ahead and plot the posts per hour and find out what are the times you post the most things.

Then, we are going to see what day of the week you are the most active too.

And at last, we are going to find out, what words can give you the most reactions!

## 5.1. Posts per Hour

To create the plot, we are going to learn about pivot tables. We could achieve something similar with the value_counts() but in this case, it is cleaner with this method. You can read up [here](http://pbpython.com/pandas-pivot-table-explained.html) a really nice introduction.

In [None]:
# Create a pivot table, taking all unique HourOfDay values, and counting the number of messages.
# The aggregate function is count.


# See the result.
print(post_hour)

In [None]:
# Let's plot the things



# Set the label on the x-axis


# Set the label on the y-axis


# Set the title and font size


# Show the grid


## 5.2 Posts by Day of Week

We do the same for the day of the week, the steps are the same, first we make the pivot table, then we plot.

In [None]:
# 1. Create the pivot table.


# Note that this time, it is order by DayOfWeekNum, which is prefered for us.
print(post_weekdays)

In [None]:
# Let's plot the data at a bar chart. Give the plot a width of 10 and height of 6. Make the bars 50% transparent.


# Set the label on the x-axis


# Set the label on the y-axis


# Set the title and font size


# Show the grid


## 5.3. Words That Give The Most Reactions

Now, we are going to find out and maybe predict what words you should be using in your posts in order to generate the most number of reactions. For this we need to do a few things:

1. Posts and Reactions are in 2 seperate files / databases, we need to merge them in some way.
2. Message or Story are long texts, so we need to process the text and, for example, remove all special characters and split by words.

In order words, the goal is to have, for each row, the list of words and the number of reactions, then we could find a way to get a ratio of number of reaction per word for example.

### 5.3.1. Add Number of Reactions to Posts Data

Let's start by counting the number of reactions for each posts and store that value in a new columns, we want to build something like this:

|id|message|n_reactions|n_like|n_haha|...|
|----|----|----|----|----|----|
|1|M1|10|5|5|...|
|2|M2|4|0|2|...|
|3|M3|30|29|2|...|
|4|M4|120|67|29|...|

In [None]:
# First, we can create a pivot table where we index the post and type and use count as aggregate function


In [None]:
# The result is exactly what we want


All right, we are nearly done with this pre-processing. Now we are going to be a little creative.

We will create the new columns `n_reactions`, `n_like`, `n_haha` and `n_angry` by using the `apply()` function.
So, we are going to define a function that will get the reaction from the `reaction_pt` pivot table based on the post_id. If the post has no reactions, meaning, it is not in the table, then we just add 0.

**Important:** This is very little code but it is doing a loooot, so take the time to really understand every parts.

In [None]:
# Example: This is how you get the item by the index.
print('Item at index 0:',  ... )
print('Reaction LIKE for item at index 0:',  ... )

In [None]:
# We define our function:


In [None]:
# We call our function in the apply using a lambda expression, x is the post_id

print('Processing n_reactions...')
posts['n_reactions'] = posts['id'].apply( lambda x : get_reaction(x, 'All') )

print('Processing n_like...')
posts['n_like'] = posts['id'].apply( lambda x : get_reaction(x, 'LIKE') )

print('Processing n_love...')
posts['n_love'] = posts['id'].apply( lambda x : get_reaction(x, 'LOVE') )

print('Processing n_haha...')
posts['n_haha'] = posts['id'].apply( lambda x : get_reaction(x, 'HAHA') )

print('Processing n_sad...')
posts['n_sad'] = posts['id'].apply( lambda x : get_reaction(x, 'SAD') )

print('Processing n_wow...') # TODO: WOW
posts['n_wow'] = ...

print('Processing n_angry...') # TODO: ANGRY
posts['n_angry'] = ...

print('All Done.')

In [None]:
# Print the first 10 posts
posts[:10]

Great! Ready for the next step.

### 5.3.2. Text-Processing on Message Column

All right, to be able to work with the message column, which is just text, we must process it. In simple Big Data analytics, this often means that we start by splitting the text into a list of words, getting rid of everything we don't need such as special symbols and stop words.

Why? So that we are able to map specific words to, for example, the number of likes. 

Does talking about your dog or cat gives you more like than complaining about Mondays? 

Then, based on a new message, how many likes could you expect? This is the kind of analysis that we can do, once the text data is transformed.

I recommend that you also read [this tutorial](https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words). It's a beginner introduction to text analysis techniques. We are going to use certain aspect explained there.

In [None]:
# Looking at messages


From the exploration we know that sometimes the message is `NaN` when story is not. For us, we are going to replace all `NaN` with en empty string `""`. What is in the story we are going to ignore it for now, feel free to work on that part for yourself.

In [None]:
# Replace all NaN with empty string ""
# fillna() return a DataFrame, where all the NaN values has been replaced with another value given


Now, we are going to demonstrate each step of the process and then use a function that performs each step for every row.

**Steps:**

1. Remove certain special symbols.
1. Lowercase everthing.
1. Split by words.
1. Remove english stop words (a, an, is, of, ...)
1. Save the result as a simplified paragraph "word1 word2 word3 ..."

Let's get to work:

In [None]:
# We will just use the first message:


In [None]:
# 1. Remove certain special symbols and replace by a space.
import re

symbols = "[.,\[\]{}|\`~\'\"*&^%$@!?+>\-\_]"

result = ...

result

In [None]:
# 2. Lowercase everthing.
result = ...

result

In [None]:
# 3. Split by words
result = ...

result

In [None]:
# English, German and French stopwords.
english_stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']
german_stopwords = ['aber', 'alle', 'allem', 'allen', 'aller', 'alles', 'als', 'also', 'am', 'an', 'ander', 'andere', 'anderem', 'anderen', 'anderer', 'anderes', 'anderm', 'andern', 'anderr', 'anders', 'auch', 'auf', 'aus', 'bei', 'bin', 'bis', 'bist', 'da', 'damit', 'dann', 'der', 'den', 'des', 'dem', 'die', 'das', 'daß', 'derselbe', 'derselben', 'denselben', 'desselben', 'demselben', 'dieselbe', 'dieselben', 'dasselbe', 'dazu', 'dein', 'deine', 'deinem', 'deinen', 'deiner', 'deines', 'denn', 'derer', 'dessen', 'dich', 'dir', 'du', 'dies', 'diese', 'diesem', 'diesen', 'dieser', 'dieses', 'doch', 'dort', 'durch', 'ein', 'eine', 'einem', 'einen', 'einer', 'eines', 'einig', 'einige', 'einigem', 'einigen', 'einiger', 'einiges', 'einmal', 'er', 'ihn', 'ihm', 'es', 'etwas', 'euer', 'eure', 'eurem', 'euren', 'eurer', 'eures', 'für', 'gegen', 'gewesen', 'hab', 'habe', 'haben', 'hat', 'hatte', 'hatten', 'hier', 'hin', 'hinter', 'ich', 'mich', 'mir', 'ihr', 'ihre', 'ihrem', 'ihren', 'ihrer', 'ihres', 'euch', 'im', 'in', 'indem', 'ins', 'ist', 'jede', 'jedem', 'jeden', 'jeder', 'jedes', 'jene', 'jenem', 'jenen', 'jener', 'jenes', 'jetzt', 'kann', 'kein', 'keine', 'keinem', 'keinen', 'keiner', 'keines', 'können', 'könnte', 'machen', 'man', 'manche', 'manchem', 'manchen', 'mancher', 'manches', 'mein', 'meine', 'meinem', 'meinen', 'meiner', 'meines', 'mit', 'muss', 'musste', 'nach', 'nicht', 'nichts', 'noch', 'nun', 'nur', 'ob', 'oder', 'ohne', 'sehr', 'sein', 'seine', 'seinem', 'seinen', 'seiner', 'seines', 'selbst', 'sich', 'sie', 'ihnen', 'sind', 'so', 'solche', 'solchem', 'solchen', 'solcher', 'solches', 'soll', 'sollte', 'sondern', 'sonst', 'über', 'um', 'und', 'uns', 'unsere', 'unserem', 'unseren', 'unser', 'unseres', 'unter', 'viel', 'vom', 'von', 'vor', 'während', 'war', 'waren', 'warst', 'was', 'weg', 'weil', 'weiter', 'welche', 'welchem', 'welchen', 'welcher', 'welches', 'wenn', 'werde', 'werden', 'wie', 'wieder', 'will', 'wir', 'wird', 'wirst', 'wo', 'wollen', 'wollte', 'würde', 'würden', 'zu', 'zum', 'zur', 'zwar', 'zwischen']
french_stopwords = ['au', 'aux', 'avec', 'ce', 'ces', 'dans', 'de', 'des', 'du', 'elle', 'en', 'et', 'eux', 'il', 'je', 'la', 'le', 'leur', 'lui', 'ma', 'mais', 'me', 'même', 'mes', 'moi', 'mon', 'ne', 'nos', 'notre', 'nous', 'on', 'ou', 'par', 'pas', 'pour', 'qu', 'que', 'qui', 'sa', 'se', 'ses', 'son', 'sur', 'ta', 'te', 'tes', 'toi', 'ton', 'tu', 'un', 'une', 'vos', 'votre', 'vous', 'c', 'd', 'j', 'l', 'à', 'm', 'n', 's', 't', 'y', 'été', 'étée', 'étées', 'étés', 'étant', 'étante', 'étants', 'étantes', 'suis', 'es', 'est', 'sommes', 'êtes', 'sont', 'serai', 'seras', 'sera', 'serons', 'serez', 'seront', 'serais', 'serait', 'serions', 'seriez', 'seraient', 'étais', 'était', 'étions', 'étiez', 'étaient', 'fus', 'fut', 'fûmes', 'fûtes', 'furent', 'sois', 'soit', 'soyons', 'soyez', 'soient', 'fusse', 'fusses', 'fût', 'fussions', 'fussiez', 'fussent', 'ayant', 'ayante', 'ayantes', 'ayants', 'eu', 'eue', 'eues', 'eus', 'ai', 'as', 'avons', 'avez', 'ont', 'aurai', 'auras', 'aura', 'aurons', 'aurez', 'auront', 'aurais', 'aurait', 'aurions', 'auriez', 'auraient', 'avais', 'avait', 'avions', 'aviez', 'avaient', 'eut', 'eûmes', 'eûtes', 'eurent', 'aie', 'aies', 'ait', 'ayons', 'ayez', 'aient', 'eusse', 'eusses', 'eût', 'eussions', 'eussiez', 'eussent']

In [None]:
# Import the predefined stopwords

print('\nEnglish Stop Words:')
print( english_stopwords )

print('\German Stop Words:')
print( german_stopwords )

print('\nFrench Stop Words:')
print( french_stopwords )

In [None]:
# 4. Remove all elements in the list that are english stopwords
result = ...
print(result)

Great! Note that we make use of many built-in libraries inside Anaconda and that for text processing those techniques can easily be applied to any text that you have. So keep those methods close to you so that you can copy and paste the functions you need.

Now let's make a function that will do all those steps and then apply it to every row:

In [None]:
import re

def process_text(raw_text):
    # 1. Remove certain special symbols and replace by a space.
    text = re.sub("[.,\[\]{}|\`~\'\"“”’*&^%$@!?+>\-\_]", " ", raw_text)

    # 2. Lowercase
    ...
    
    # 3. Split by words
    ...
    
    # 4. Remove English stop words
    
    # Sets are much faster than lists in Python for membership comparisons
    stops = set(english_stopwords)
    
    ...
    
    # 5. Return the words concantenated with eachother
    ...

In [None]:
# Let's try the method for multiple inputs:


In [None]:
# Now we can apply it to every row and store it in a new column
posts['processed_message'] = ...

In [None]:
# And Tadaaaaa


Now, let's loop over all rows, take each word of the `processed_message` column and create a new dictionary where each word is a key and the value is the sum of the number of reactions.

In [None]:
# Define empty dictionary


# Define function that will process one processed_message
def handle_processed_message(message, n_reactions):
    # Split by space
    words = ...
    
    # Loop over words and add n_reactions to the dictionary
    ...
        # If the word is already know, add
        ...
            
print('Starting...')
# Call the method for each row:
posts.apply(lambda row : ..., axis=1)

print('Done')

In [None]:
# See the dictionary
word_reaction

In [None]:
# Convert to Serie
word_reaction = ...

# Sort from Big to Small
...

In [None]:
# Take a look at the 10 best


In [None]:
# Take a look at the 10 worst


In [None]:
# Plot the 100 best words

# Let's plot the data at a bar chart.
# Make sure that the plot is big enough and DO NOT plot everything, just select a subset, because otherwise it
# will be unreadable


# Set the label on the x-axis


# Set the label on the y-axis


# Set the title and font size


# Show the grid


In [None]:
# Plot the 100 worst words

# Let's plot the data at a bar chart.
# Make sure that the plot is big enough and DO NOT plot everything, just select a subset, because otherwise it
# will be unreadable


# Set the label on the x-axis


# Set the label on the y-axis


# Set the title and font size


# Show the grid
