# TATR: Panda Dataframe Manipulation

This notebook is part of a greater series of Juypter Notebook structured around Twitter Tweet analysis. This particular notebook will look at Panda dataframe manipulation. It looks closely at panda index, groupby, and the count features of panda. This notebook aims to teach these concepts and how they interact with a csv file. 


Any additional assumptions and clarification will be discussed and declared throughout the notebook.

Written 2018.

### Note:

This notebook builds upon ideas and concept in TATR: Panda and CSV of Tweets

## Import Libraries

Now we will import all the Python 3 libraries that will be used in this notebook. You do not need to know all the functionalities of each library as some are massive. However, any functionalities that are used will be explained as they appear, so do not worry too much if you do not recognize the libraries. 

To import or download the required libraries see the Jupyter documentation or the libraries' home page for instruction. 

### Note: 
All libraries that are used are available for Anaconda 

In [1]:
# Importing data structure libraries
import pandas as pd
import numpy as np

## Setting up some dummy data

Similar to before, this notebook will create a small set of dummy data to demonstrate. This is to ensure that all those using this notebook have the ability to test out its functionality. However, we will be making some modifications and changes to the data made in the the other notebook.

### Dummy Panda Dataframe Structure

We are going to create 8 entries. We will split the entries into two halves (this will be explained later) and each entry will have 3 values (Date, hashtag_count, mention_count). To find out how to generate these values see TATR: Tokenization and Extraction. 

To do so we will use 2 different functions to help create the Date and Retweet count
* pd.Timestamp(Some date format) : Which turns the input into a date 
* np.random.randint(range, how many) : Creates a random integer from 0 to range for the declared amount


In [2]:
# Creating the first dataframe
FirstDataframe = pd.DataFrame({ 'Date' : pd.Timestamp('20180101'),
                                'hashtag_count' : np.random.randint(3, size=4),
                                'mention_count' : np.random.randint(10, size=4)
                              })

# Creating the second dataframe
SecondDataframe = pd.DataFrame({ 'Date' : pd.Timestamp('20180102'),
                                'hashtag_count' : np.random.randint(5, size=4),
                                'mention_count' :np.random.randint(8, size=4)
                              })

Now to see what the two dataframes look like

In [3]:
# Lets see the first dataframe
FirstDataframe

Unnamed: 0,Date,hashtag_count,mention_count
0,2018-01-01,1,0
1,2018-01-01,1,2
2,2018-01-01,1,2
3,2018-01-01,0,6


In [4]:
# Lets see the second dataframe
SecondDataframe

Unnamed: 0,Date,hashtag_count,mention_count
0,2018-01-02,2,5
1,2018-01-02,4,1
2,2018-01-02,0,7
3,2018-01-02,2,4


Before we continue we are going to fuse these two dataframes together. This is because we want to be able to manipulate all the data at once. The reason for initially creating two different panda dataframes was because it was easier to create and it also allows us to showcase how to combine dataframes. To do so, we will be using the panda function "concat".

In [5]:
# Combine the two dataframes together
CombinedDataFrame = pd.concat([FirstDataframe, SecondDataframe])

# See the new dataframe
CombinedDataFrame

Unnamed: 0,Date,hashtag_count,mention_count
0,2018-01-01,1,0
1,2018-01-01,1,2
2,2018-01-01,1,2
3,2018-01-01,0,6
0,2018-01-02,2,5
1,2018-01-02,4,1
2,2018-01-02,0,7
3,2018-01-02,2,4


As you can see the "index" did not update but rather all the original data was kept. However, this "index" doesn't serve a purpose. Therefore we are going to replace it with the "date" column. To do so we are going to use "set_index"

To find out more about set_index see:
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.set_index.html


In [6]:
# Set the index to the Date
CombinedDataFrame.set_index('Date')

Unnamed: 0_level_0,hashtag_count,mention_count
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01,1,0
2018-01-01,1,2
2018-01-01,1,2
2018-01-01,0,6
2018-01-02,2,5
2018-01-02,4,1
2018-01-02,0,7
2018-01-02,2,4


As you can see, the date is set to be the index of the dataframe. Now we are going to combine all the values of a date together. This is done using a combination of "sum" and "groupby". Sum is self explanatory as it would sum all the values in a column, while groupby allows the different entries to collapse. We are going to group the data by data and sum all the values of each column. We will also create another dataframe that would count the amount of entries as well.

To find out more about groupby:
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

To find out more about count:
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.count.html

To find out more about sum:
* https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html

In [7]:
# Group the data by "Date" and sum all the values
SumDataFrame = CombinedDataFrame.groupby("Date").sum()

# See what it looks like
SumDataFrame

Unnamed: 0_level_0,hashtag_count,mention_count
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01,3,10
2018-01-02,8,17


Nowe we are going to add in the entries count to the dataframe

In [8]:
# Group the data by "Date" and count all the values
CountDataFrame = CombinedDataFrame.groupby('Date').count()

# See what it looks like
CountDataFrame

Unnamed: 0_level_0,hashtag_count,mention_count
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01,4,4
2018-01-02,4,4


As you can see it counted all the entires for each column for each date. This may be useful to keep track of depending on your needs.

## Conculsion

In this notebook we went over Panda dataframe manipulation. Although this notebook does not go over all the different features and functionalities that Panda offers, it does showcase some of the more common ones. These features will be used in a later notebook, and therefore it is important to understand how these features work.

This notebook serves as one of the more introductory notebook in the TATR notebook series.