# TATR: Panda and CSV of Tweets

This notebook is part of a greater series of Juypter Notebook structured around Twitter Tweet analysis. This particular notebook will look at loading in a CSV (comma-separated values) file into a panda datastructure. Serving as one of the introductory coding notebook in the series, this notebook aims to teach and showcase how to load and save tweets into CSV files.

Any additional assumptions and clarification will be discussed and declared throughout the notebook.

Written 2018.

## Introduction: CSV

CSV is short hand for comma-separated values. It is a common data structure that can be imported into software such as Mircosoft EXCEL. Although the name implies that each data entry is seperated with a comma that is not always the case. It is possible to sperate values by: comma, tabs, space, etc. Therefore it is important to be aware how your CSV or someone else's CSV is structured. 

Therefore the following are all valid forms for a CSV file:
   * Jan-01-18, green, 10, good, comments
   * Jan-01-18 green 10 good comments

Typically in a CSV file each entry takes up 1 line. Therefore if you have 100 data entry there will be 100 lines in the file.

To find out more about CSV:
* https://en.wikipedia.org/wiki/Comma-separated_values

## Import Libraries

Now we will import all the Python 3 libraries that will be used in this notebook. You do not need to know all the functionalities of each libraries as some are massive. However any functionalities that is used will be explained as they appear, therefore do not worry too much if you do not recongize the libraries. 

To import or download the required libraries see the Juypter documentation or the libraries's home page for instruction. 

### Note: 
All libraries that are used are available for Anaconda 

In [28]:
# Importing data structure libraries
import pandas as pd
import numpy as np

## Introduction Panda Dataframe

We will begin with an introduction to Panda dataframe. This is the primary data structure used in TATR notebook series for handing Twitter data. The reason for this is because Panda dataframe allows for a wide varity of functionality for data manipulation and has a large amount of documentation to support it. 

Therefore it is recommended that if there is any question or feature of Panda that you are unsure of or want to know more about see the following:
* https://pandas.pydata.org/pandas-docs/stable/tutorials.html

The panda data structures allows use to create a two dimensonal data structure. In addition, it allows us to create labels for each row and coloumn as well as load directly from CSV. Therefore to begin we are going to create some "dummy twitter data". This is to ensure that everyone looking at this notebook can reproduce the same results. Feel free to skip this part if you are already familiar with Panda or have data to used already.

### Dummy Panda Dataframe Structure

We are going to create 4 entries and each entry will have 3 values (Date, Retweet count, text).

To do so we will used 2 different functions to help create the Date and Retweet count
* pd.Timestamp(Some date format) : Which turns the input into a date 
* np.random.randint(range, how many) : Creates a random integer from 0 to range for the declared amount


In [29]:
# Creating the panda dataframe
pandaDataFrame = pd.DataFrame({ 'Date' : pd.Timestamp('20180101'),
                                'Retweet_Count' : np.random.randint(5, size=4),
                                'Text' :["Hey all!","#Fake Data","Not real","text"]
                              })

# Lets see what the dataframe look like
pandaDataFrame

Unnamed: 0,Date,Retweet_Count,Text
0,2018-01-01,0,Hey all!
1,2018-01-01,0,#Fake Data
2,2018-01-01,3,Not real
3,2018-01-01,4,text


As you can see it created a panda data structure with 4 entries numbered from 0 to 3 (This is referred to the index). Each data value in each entry also belongs to a category (Date, Retweet_Count, Text which was declared when we created the dataframe). If you want to rearrange the order of the columns just swap their place when creating the dataframe above.

## Saving Panda Dataframe into a CSV

Now that we have a dataframe of data to use, we will now save the results of the dataframe a CSV. We are going to create a function that reads our dataframe and converts it into a CSV. The reason for this is so you can copy and paste this function out of this notebook and use it in another project. 

### Note:
We will not be saving the index of the panda dataframe. This is because for this example it has no value. However it is noted in the function how to save it.

In [30]:
'''
Save the dataframe into a file

:dataframe: The dataframe that is being saved
:name_of_file:    The name of the CSV file you wish to save it as
'''
def save_frame_to_CSV(dataframe, name_of_file):
    
    print("Begin saving dataframe into a csv...\n")
    
    # Attach the CSV file extendsion to the name
    name = name_of_file + ".csv"
    
    # Convert the dataframe into a CSV that is seperated by commas
    # Remove "index=False" if you want to save the index
    dataframe.to_csv(name, sep=',', encoding='utf-8', index=False)
    
    print("Finish and saved into " + name + "\n")

Now that we have the function to save into a dataframe, we can begin saving the content of pandaDataFrame into a CSV. This is done by calling the function and filling in the parameters

In [31]:
# Save the dataframe into a CSV
save_frame_to_CSV(pandaDataFrame, "DummyTestCSV")

Begin saving dataframe into a csv...

Finish and saved into DummyTestCSV.csv



## Looking at the CSV
If you open up the CSV with any text editor, you will see this:
    
Date,Retweet_Count,Text  
2018-01-01,1,Hey all!  
2018-01-01,1,#Fake Data  
2018-01-01,2,Not real  
2018-01-01,2,text  

As you see the category label of our panda dataframe was saved as the first row. The "," at the very start may seem out of place, but the space in front of that comma is used to label the index (first column of the CSV). 

## Loading the CSV 
Similar to before, we are going to create a new function that loads in the CSV and we do not account for the index being saved eariler.

### Note:
If you want to know more about how Panda handles CSV see the following:
* https://chrisalbon.com/python/data_wrangling/pandas_dataframe_importing_csv/

In [32]:
'''
Reads a CSV and turn it into a panda dataframe

:csv_file_name: The CSV file name
'''
def read_CSV_file(csv_file_name):
    
    print("Reading csv " + csv_file_name + "...\n")
    
    # Attach the CSV file extension
    name = csv_file_name + ".csv"

    return_frame = pd.read_csv(name)
    
    print("Finish reading " + csv_file_name + "\n")
    return return_frame

Now that we have created a function to load in a CSV, we can begin loading back the CSV

In [33]:
# Read in the CSV
loadedDataframe = read_CSV_file("DummyTestCSV")

# See what the dataframe look like
loadedDataframe

Reading csv DummyTestCSV...

Finish reading DummyTestCSV



Unnamed: 0,Date,Retweet_Count,Text
0,2018-01-01,0,Hey all!
1,2018-01-01,0,#Fake Data
2,2018-01-01,3,Not real
3,2018-01-01,4,text


As you can see it automatically created an index for you. Therefore for those who saved the index when creating the CSV there are additional steps that are required.

## Conculsion

In this notebook we went over what CSV is. In addition to a brief overview on what Panda does, we looked at how it can be used to save data into a CSV format as well as load them from a file. Although this notebook contains no code or concepts that are exclusive to tweets or Twitter, it does go through essential features that will be used in other TATR notebook.

Therefore this notebook serves as one of the more introductory notebook in the TATR notebook series.