
<figure>
   <IMG SRC="https://mamba-python.nl/images/logo_basis.png" WIDTH=125 ALIGN="right">
</figure>



# Exercises Whatsapp


This notebook contains an exercise with a very commonly used pyton package called pandas. We use this to analyze our whatsapp data. The data was obtained by exporting a single Whatsapp chat. See https://faq.whatsapp.com/en/android/23756533/ to export your own data.
 You can do this for this excercise if you are interested! If not, there is also an anonymus Whatsapp chat available. 
 
<div style="text-align: right"> developed by MAMBA </div>
 This notebook is part of the Mamba python course. 





table of content:<a class="anchor" id="0"></a>
1. [import files](#1)
2. [read whatsapp data](#2)
3. [anonymize data](#3)
4. [Exercises](#4)
5. [Answers](#5)

## 1. import files<a class="anchor" id="1"></a>

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import whatsapp_func as wf

In [2]:
#settings
%matplotlib inline
plt.style.use('seaborn')

## 2. read whatsapp data <a class="anchor" id="2"></a>
Below is the code to read a .zip file with whatsapp chat history. Because of privacy issues I did not include my own zip file with my chat history. But you can use your own! If you prefer not to, you can skip this step and read the anonymised pandas dataframe file in chapter 3. If you <it> are </it> using your own data, you can skip chapter 3 instead. 

In [44]:
# read whatsapp data (save the zip in the data folder and fill out the full name in the code below)
whatsapp_zip = r'data\WhatsApp Chat - xxx xxx.zip'

time_user_df = wf.read_whatsapp(whatsapp_zip)

### 2.1 Pandas dataframe

We save the data in a pandas dataframe (the time_user_df). A dataframe is an object of the package Pandas, which is one of the most used packages for scientific applications with Python. It is the must-have tool for data analysis. Within Pandas, the dataframe is the most used pandas object. The dataframe gives you a way to order and work with your data. You can think about it as a perfect spreadsheat, a SQL table or a indexed dictionary. In a dataframe there are columns and rows, to which you can optionally pass names.

In the excercises below we will work with the Pandas dataframe intuitively. For a more in-depth tutorial, continue after this notebook with the excersise notebook found in on_topic/1_pandas. 

## 3. anonymous data <a class="anchor" id="2"></a>
This chapter is only necessary if you want to save your data as anonymous data or if you want to use the anonymized dataset (`_chat_df.csv`) for analysis. Uncomment the code lines that you want to use.

In [36]:
## save anonymized dataframe

# time_user_df[['user','message']].to_csv(r'data\_chat_df.csv')

In [37]:
## read anonymized dataframe

# time_user_df = pd.read_csv(r'data\_chat_df.csv', index_col=0, parse_dates=True)
# time_user_df.head()

## 4. Exercises<a class="anchor" id="4"></a>

Note: Below are exercises to get you started with data analysis. If, at any time, you come up with your own idea for analysis or visualisation, that's awesome and please try to make it work and show us. This is the best way to learn how to do data analysis in Python. Use Google, Stackoverflow, Pandas documentation to find out how to do things. 

You can print the first 5 lines of your dataframe using the `head()` method.

In [45]:
time_user_df.head()

Unnamed: 0,user,text,message
2018-03-22 12:18:07,user1,Nee voor de gemeenteraad is alleen per lijst ...,1
2018-03-22 12:18:19,user1,Ze gaan nu in Ahoy per kandidaat tellen,1
2018-03-22 14:39:57,user2,ah oké,1
2018-03-22 14:40:03,user2,enig idee tot hoe laat dat duurt?,1
2018-03-22 14:45:13,user1,Nee,1


#### Exercise 1
Have a look at the dataframe `time_user_df` you've obtained. How many columns does it have? What is the index?

#### Exercise 2
How many messages are in your exported chat history?

#### Exercise 3
Find the unique users in your chat history

#### Exercise 4
You can add columns to a DataFrame. To add a column with ones you can use the code below.

In [46]:
time_user_df['column_with_ones'] = 1
time_user_df.head()

Unnamed: 0,user,text,message,column_with_ones
2018-03-22 12:18:07,user1,Nee voor de gemeenteraad is alleen per lijst ...,1,1
2018-03-22 12:18:19,user1,Ze gaan nu in Ahoy per kandidaat tellen,1,1
2018-03-22 14:39:57,user2,ah oké,1,1
2018-03-22 14:40:03,user2,enig idee tot hoe laat dat duurt?,1,1
2018-03-22 14:45:13,user1,Nee,1,1


You can create a new column by using data from another column. Create a new column containing a string which consist of the user name and the text 'whatsapp'. For example if the username of a row is 'user1' your new column should have the value 'user1 whatsapp'.

#### Exercise 5
Create an extra column with the number of characters in your text column.

#### Exercise 6

You can use the method `groupby()` to group your data by the items in a certain column. You can group the data per user with the code below. With this you obtain a GroupBy object. To get the relevant data per user you have to specify how to handle the data in the other columns. To get the sum of the data in the other columns 

In [49]:
df_gb = time_user_df.groupby('user')
df_gb.sum()

Unnamed: 0_level_0,message,column_with_ones
user,Unnamed: 1_level_1,Unnamed: 2_level_1
user1,914,914
user2,1279,1279



get the average length of a text message per user. Note: you need the answer to exercise 5 for this question.

#### Exercise 7
Use the examples from the [example notebook](whatsapp_data_analysis.ipynb) to plot your own graphs in the analysis. Try to follow line by line what is done.

## 5. answers<a class="anchor" id="5"></a>

#### Answer Exercise 1

In [48]:
# columns in dataframe
print(time_user_df.columns)

# index
print(time_user_df.index)

Index(['user', 'text', 'message', 'column_with_ones', 'username_and_whatsapp'], dtype='object')
DatetimeIndex(['2018-03-22 12:18:07', '2018-03-22 12:18:19',
               '2018-03-22 14:39:57', '2018-03-22 14:40:03',
               '2018-03-22 14:45:13', '2018-03-22 17:29:51',
               '2018-03-22 17:29:59', '2018-03-22 18:22:55',
               '2018-03-22 19:11:21', '2018-03-22 20:21:13',
               ...
               '2018-09-25 18:58:57', '2018-09-25 19:00:25',
               '2018-09-25 19:00:34', '2018-09-25 22:48:07',
               '2018-09-25 22:53:46', '2018-09-25 22:53:52',
               '2018-09-25 23:06:15', '2018-09-25 23:06:58',
               '2018-09-26 11:25:12', '2018-09-26 11:29:02'],
              dtype='datetime64[ns]', length=2193, freq=None)


#### Answer Exercise 2

In [14]:
# total number of messages
# option 1
print(time_user_df.shape[0])

# option 2
print(time_user_df.message.sum())

# option 3
print(len(time_user_df))

2193
2193
2193


#### Answer Exercise 3

In [16]:
#unique users
print(time_user_df.user.unique())

['user1' 'user2']


#### Answer Exercise 4

In [47]:
time_user_df['username_and_whatsapp'] = time_user_df['user'] + ' whatsapp'
time_user_df.head()

Unnamed: 0,user,text,message,column_with_ones,username_and_whatsapp
2018-03-22 12:18:07,user1,Nee voor de gemeenteraad is alleen per lijst ...,1,1,user1 whatsapp
2018-03-22 12:18:19,user1,Ze gaan nu in Ahoy per kandidaat tellen,1,1,user1 whatsapp
2018-03-22 14:39:57,user2,ah oké,1,1,user2 whatsapp
2018-03-22 14:40:03,user2,enig idee tot hoe laat dat duurt?,1,1,user2 whatsapp
2018-03-22 14:45:13,user1,Nee,1,1,user1 whatsapp


#### Answer Exercise 5

In [51]:
time_user_df['text_length'] = time_user_df['text'].str.len()

#### Answer Exercise 6

In [52]:
#message per user
df_gb = time_user_df.groupby('user')
df_gb.mean()

Unnamed: 0_level_0,message,column_with_ones,text_length
user,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
user1,1.0,1.0,36.702407
user2,1.0,1.0,38.825645


#### Answer Exercise 7