The goal of this project with to compare tweets sentiment score with their location and time. Therefore, we need to be able to give some *time* information about each tweet. That the purpose of this code.

Each tweet has a time information in its **createdAt** column. This column contains strings that we can easily transform to a *datetime* object. From that object, we retrieve the **year, month, ** and **day** for the tweets. We only do it for tweets who are valid (meaning their index is an integer), and who are also present in the *location dataset* (because we will need to join everything after, so no need to deal with tweets that we already know to be uselesss).

Below is the python script we launch to create a csv file for each chunk of data. Once the script was done, we merge all these files. The csv file create are structured like : **Tweet id, Year, Month, Day**.

In [None]:
import pandas as pd
import csv



#
## Merge the District and Canton information
#

# Load the districts informaiton
df_districts = pd.read_table('data/ch_districts_capital.txt', sep=',')
df_districts.index = df_districts.id_ofs

# Load the cities data, containing the canton of each city
switzerland_cities = pd.read_csv("data/switzerland_cities.txt")
switzerland_cities = switzerland_cities.sort_values('Population', ascending=False)

# Combine districts with their appropriate canton
schema_rawfile = pd.read_csv("twitter-swisscom/schema_home.txt", header=None, sep='\s+')
data_columns = schema_rawfile[1].values



#
## Retrieve the dates infromations for each tweet
#


# Variable used inside the for-loop loading the data
i = 0                                                                                                   # Current chunk of data
usecolumns = ['id', 'createdAt']                                                                        # The only columns of interest for retrieve the date information we want
data_loc = pd.read_csv("data_created/data_withLocation.csv", index_col=0, names=['District', 'Canton'])         # Load the id+Canton+District computed previously. 

# Load the data in chunks
for data in pd.read_table(open("twitter-swisscom/twex.tsv", 'r'), sep='\t', escapechar="\\", na_values='N', encoding='utf-8', quoting=csv.QUOTE_NONE, header=None, names=data_columns, engine='c', usecols=usecolumns, chunksize=10000 ):
    
    # Remove all entries with an invalid id. A valid id must be an integer.
    data = data[~data.id.isnull()] 
    data['isIdxValid'] = data.apply(lambda row: str(row.id).isdigit(), axis=1)
    data = data[data.isIdxValid == True]
    data.id = data.id.astype('int64')
    
    # If we already remove all entries of the chunk, go to the next one
    if (len(data) == 0):
        i +=1
        continue
    

    # Keep the one for which we have the location info only. Useless to compute the date for tweets we don't have the location
    data['idOK'] = data.apply(lambda row: row.id in data_loc.index, axis=1)
    data = data[data.idOK]
    
    
    # If we still have some entries in this chunk of data, let's retriveve the data information we want.
    if (len(data) != 0):
        
        # Set id
        data.index = data.id
        
        # Remove the ones without the date info
        data['hasDate'] = data.createdAt != '0000-00-00 00:00:00'
        data = data[data.hasDate]
        
        # Convert the type
        data['date'] = pd.DatetimeIndex(data['createdAt']).normalize()
        
        # Retrieve the required date informaiton
        data['Year'] = pd.DatetimeIndex(data['date']).year
        data['Month'] = pd.DatetimeIndex(data['date']).month
        data['Day'] = pd.DatetimeIndex(data['date']).day
        
        
        # Export: "Tweet id" + "Year" + "Month" + "Day"
        data_to_export = data[['Year','Month','Day']]
        name = 'data_dates/data_'+str(i)+'.csv'
        data_to_export.to_csv(name, header=False)
    
    
    # Move to the next chunk
    i += 1
    if i%10==0: print(i)