# English Literature Analysis

## Data Preparation / Data Cleaning

### Tasks

1. Read all the sheets in excel file(input) to a list of dataframes.
2. Standardize the column names.
3. Match the names in different sheets and update corresponding addresses.
 * Mark the conflicts in a different column.
4. Update the latitudes and longitudes of entries with same addresses.
5. Find the missing latitudes and longitudes for the addresses using API.
6. Remove/merge unnecessary columns and create final dataframe.
7. Write the final dataframe into excel.

### Libraries

In [34]:
import os.path
import pandas as pd
import numpy as np
from geopy.geocoders import Nominatim # Required to find latitudes and longitudes of places
from pandas import ExcelWriter

### Important constants

In [35]:
INPUT_FILE_NAME = 'data/Network Data Lists-7-14.xlsx'
OUTPUT_FILE_NAME = 'data/Final_dataset.xlsx'
INPUT_COLUMNS = ['Name', 'List_ID', 'Year', 'Reformatted_Name','Gender', 'is_from_London', 'London_Street',
                 'London_Region', 'London_Lat_Long', 'City','Non_London_City_Lat_Long', 'Nation']
OUTPUT_COLUMNS = ['Name', 'Reformatted_Name', 'List_ID', 'Year','Gender', 'is_from_London', 'London_Street',
                 'London_Region', 'City', 'Nation', 'Lat_Long']
SHEETS = None

### 1. Read all the sheets in excel file(input) to a list of dataframes.

In [36]:
# Returns list of dataframes
def read_input():
    global SHEETS
    df_list = []
    xl = pd.ExcelFile(INPUT_FILE_NAME)
    SHEETS = xl.sheet_names[:-1]        # Remove last sheet(Project Details)
    print("Sheets in dataframe")
    print(SHEETS)
    for i in range(len(SHEETS)):
        df = xl.parse(SHEETS[i])
        df_list.append(df)
    return df_list

### 2. Standardize the column names.

In [37]:
def standardize_column_names(df_list):
    for df in df_list:
        df.columns = INPUT_COLUMNS + list(df.columns)[len(INPUT_COLUMNS):]
    return df_list

# Utility function to check column names
def check_column_names(df_list):
    for df in df_list:
        print(df.columns[:len(INPUT_COLUMNS)])

### 3. Match the names in different sheets and update corresponding addresses.

### 4. Update the latitudes and longitudes of entries with same addresses.

### 5. Find the missing latitudes and longitudes for the addresses using API.

### 6. Remove/merge unnecessary columns and create final dataframe.

### 7. Write the final dataframe into excel.

In [38]:
def save_xls(df_list):
    global SHEETS
    writer = ExcelWriter(OUTPUT_FILE_NAME)
    for i in range(len(SHEETS)):
        df_list[i].to_excel(writer,SHEETS[i])
    writer.save()

## Client

In [39]:
df_list = read_input()

Sheets in dataframe
['Emma Lyon 1812', 'Bolaffey 1820', 'Polack 1830', ' Moss 1839', ' Infant 1841', ' VOJ 1842-45', 'Belisario 1856']


In [40]:
df_list = standardize_column_names(df_list)
#check_column_names(df_list)

In [41]:
save_xls(df_list)