# INTRODUCTION

- In this document, I would be going through the basic process of cleaning and preparing data using Python. 


- As data scientists, we would always have to get our hands dirty when dealing with data; we will most often not have it "clean". 


- So, take a seat as I take you through the process of data cleaning and I promise you will be getting the hang of it in no time.


- We'll perform data cleaning on a real-world data set of artworks contained in the Museum of Modern Art (MoMA). The dataset can be found and downloaded [here](https://raw.githubusercontent.com/Tess-hacker/Cleaning-and-Preparing-Data-in-Python/master/artworks.csv)


- Each column has a header contained in the data set. Information on the column is given below:

#### INFORMATION ON THE COLUMNS CONTAINED IN THE DATA SET
     Title: The title of the artwork.
     Artist: The name of the artist who created the artwork.
     Nationality: The nationality of the artist.
     BeginDate: The year in which the artist was born.
     EndDate: The year in which the artist died.
     Gender: The gender of the artist.
     Date: The date that the artwork was created.
     Department: The department inside MoMA to which the artwork belongs.

## READING AND IMPORTING THE FILE 

- The first and foremost step to have the dataset present in our codes is to import the dataset using the csv reader function. Let us do this:


- Remember to use the path where the dataset is located on your computer so that you can easily import the data.

In [11]:
from csv import reader
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
import unicodecsv as csv
opened_file = open(r'C:\Users\USER\Documents\ONLINE COURSES\DATAQUEST\artworks.csv', 'r', encoding='utf-8')
read_file = reader(opened_file)
moma =list(read_file)
moma= moma[1:] # we specified this index so that we can remove our header row.
# print (moma)

In [8]:
# using the path from the link provided above, you can import the file as follows:
import csv
from csv import reader
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
url = 'https://raw.githubusercontent.com/Tess-hacker/Cleaning-and-Preparing-Data-in-Python/master/artworks.csv'
moma_df = pd.read_csv(url,header = None, error_bad_lines=False) # we have set the header to 'None' to enable us remove the first row
print (moma_df[1:])

                                                       0                   1  \
1                         Dress MacLeod from Tartan Sets  Sarah Charlesworth   
2      Duplicate of plate from folio 11 verso (supple...     Pablo Palazuelo   
3                       Tailpiece (page 55) from SAGESSE       Maurice Denis   
4      Headpiece (page 129) from LIVRET DE FOLASTRIES...    Aristide Maillol   
5                                          97 rue du Bac        Eugène Atget   
...                                                  ...                 ...   
16721                                   Oval with Points         Henry Moore   
16722    Cementerio de la Ciudad Abierta, Ritoque, Chile         Juan Baixas   
16723                                        The Catboat       Edward Hopper   
16724  Dognat' i peregnat' v tekhniko-ekonomicheskom ...             Unknown   
16725                      Plate (page 11) from The Dive           Alex Katz   

                2     3     4         5

## REPLACING SUBSTRINGS WITH THE 'REPLACE' METHOD


- If we have a string that contains certain words and letters which we want to replace, we can do that using the `str.replace()` function. Once we make use of this function, all the instances of where that word, sentence or letter occurs is replaced immediately with the new value. For instance:

In [None]:
fav_color = "red is my favorite color"
fav_color = fav_color.replace('red', 'Black')
print( fav_color)

- When we try to replace an individual letter within a string, this is what we get:

In [None]:
fave_color = "red and pink are my favorite colors"
fave_color = fave_color.replace('r', 'R')
print (fave_color)

- Before we practice using `str.replace()` to place a substring, let's quickly recap few important things:
    - Parts of strings are called substrings.
    - We can use the `str.replace()` method to find and replace substrings.
    - str.replace() requires two arguments:
        old: The substring we want to find and replace.
        new: The substring we want to replace old with.
    - When we use str.replace(), we substitute the `str` for the variable name of the string we want to modify.
    - We need to use `=` to assign the modified string to a variable name.

In [None]:
age1 = "I am thirty-one years old"
age2 = age1.replace('one','two')
print (age2)

## CLEANING THE GENDER AND NATIONALITY COLUMNS

- As seen in the previous heading, we have discovered that our Nationality and Gender columns contain parentheses. We want to remove these parentheses and so, we can use the `str.replace()` function.

- Also, we have some of the rows for the gender and nationality containing null values and we want to replace them with a string that shows information about this. Thus, the function we would write after will change all inputs into title case as well as replace all null values with a specific string. 

In [None]:
for row in moma:# you can replace 'moma' with 'df' if you're using the second approach
    nationality = row[2]
    nationality = nationality.replace ('(','')
    nationality = nationality.replace (')','')
    row[2] = nationality
    print ('The adjusted nationality is:')
    print (nationality)
for row in moma:
    gender = row[5]
    gender = gender.replace ('(','')
    gender = gender.replace (')','')
    row[5] = gender
    print ('The new and adjusted gender is:')
    print (gender)
    
for row in moma:
    gender = row[5]
    # convert the gender to title case
    gender = gender.title()
    # if there is no gender, set
    # a descriptive value
    if not gender:
        gender = "Gender Unknown/Other"
    row[5] = gender

for row in moma:
    nationality = row[2]
    nationality = nationality.title()
    if not nationality:
        nationality = "Nationality Unknown"
    row[2] = nationality

## FURTHER DATA CLEANING


- We have cleaned the data related to gender and nationality. We would like to do this with the EndDate and BeginDate columns. If you want to know the details contained in these columns, check the explanation in the introduction part above.


- Just like the gender and nationality columns, the enddate and begindate columns also contain parentheses and are stored as strings. We need to remove them like the previous columns. 


- We could go through the process and remove these characters manually using the approach `str.replace()` in the previous cell. However, this will prove tedious especially when we are dealing with a large dataset. So, let us try to create a function for this purpose.

In [None]:
def adjust_date(date):
    # we need to factor in a condition to cater for empty strings inside the dataset to avoid an error
    if date != "":
        date = date.replace("(", "")
        date = date.replace(")", "")
        date = int(date)
    return date
for row in moma:# you can replace 'moma' with 'df' if you're using the second approach
    BeginDate = row[3]
    EndDate = row[4]
    Date = row[6]
    adjusted_begindate = adjust_date(BeginDate)
    adjusted_enddate = adjust_date(EndDate)
    print ('The adjusted begindate is:')
    print (adjusted_begindate)
    print ('\n')
    print ('The adjusted enddate is:')
    print (adjusted_enddate)
    
    

- We have iterated through and cleaned the data on the start and end dates for the artists on the data base. However, we want to go further and combine these cleaned data with their actual specific dates. The *Date* column is the 6th column in the data base and we would like to remove some unwanted characters to acheive our purpose.


- The date column unlike the previous columns we have cleaned contains more unwanted characters than the others and it would be pretty ineffective to write a `str.replace()` function for each unwanted character. Thus, just like in the previous cell,we would need to write a function that encapsulates this goal.

In [13]:
bad_chars = ["(",")","c","C",".","s","'", " "]# we specify all the bad characters contained within our dataset
adjusted_dates = [] #creating an empty set where our cleaned date data would be stored.
def delete_characters(string):
    for char in bad_chars:
        string = string.replace(char,"")
    return string

dates = []
for row in moma:
    row = row[6]
    adjust = delete_characters(row)
    dates.append(adjust)
    print (dates[:10])

['1986']
['1986', '1978']
['1986', '1978', '1889-1911']
['1986', '1978', '1889-1911', '1927-1940']
['1986', '1978', '1889-1911', '1927-1940', '1903']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-198

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940', '1903', '1957', '1924', '1978-1983', '2001', '1941']
['1986', '1978', '1889-1911', '1927-1940

- Finally, we would love to put a last finishing touch on our dates. After removing all the unwanted characters in our data, we want to deal with the dates that are stated in ranges.


- Some of the dates are singularly stated while others are printed as ranges. This means we have two different sets of dates present. To address this issue, we can convert our date ranges into averages. To do this we need to apply python's `str.split` function. This function is applied the same way as the `str.replace()` function.


- To apply this function, we cannot just apply it on the column directly once and for all as this will affect the dates that do not have ranges. Thus, we need to write a function that does the following for us:

    - Helps us identify the dates stated in ranges
    
    - Splits the dates into individual dates
    
    - Finds the average between the two dates and gives a rounded figure using the `round()` function
    
    - Skips the dates that are not printed in ranges
    
    - Prints all the dates (singular and average) as an integer using the `int()` function

In [14]:
bad_chars = ["(",")","c","C",".","s","'", " "]
def process_date(date):
    if "-" in date:
        split_date = date.split("-")
        date_one = split_date[0]
        date_two = split_date[1]       
        date = (int(date_one) + int(date_two)) / 2
        date = round(date)
    else:
        date = date
    return date
for row in moma:
    date = row[6]
    adjust = delete_characters(date)
    date = process_date(adjust)
    print (date)

1986
1978
1900
1934
1903
1957
1924
1980
2001
1941
1950
1963
1910
1934
1997
1932
1972
1967
1924
1979
1926
1929
1972
1974
1957
1925
1974
1915
1912
1989
1925
1925
1980
1964
1968
1969
1934
1953
1971
1988
1818
2002
1926
1969
1914
1979
1968
1979
1966
1904
1949
1981
1970
1969
1972
1924
1967
1932
1928
1965
2003
1972
1983
1930
1946
1984
2006
1885
1942
1913
2015
1987
1948
1965
1818
1961
1925
1924
1962
1944
1930
1991
1992
2008
1975
1893
1966
1968
1966
1934
1969
1942
1980
1955
1930
1958
1922
1934
1903
2008
1962
1944
1916
2003
2012
1965
1865
1976
1902
2003
2010
1944
1963
1875
1948
1990
1913
1958
1921
1970
1950
2004
1926
1973
1915
2003
1973
1986
1982
1910
1934
1971
1926
1960
1939
1893
1952
2007
1940
1995
1973
1939
1913
1908
1971
1936
1960
1975
1976
2005
1933
1957
1960
1982
1956
1910
1983
1972
1969
1965
1920
1936
1973
1919
1905
1918
1997
1875
1907
1936
1999
1929
1992
1971
1942
1933
1965
1936
2011
1978
1951
1926
2004
1920
1966
1937
1991
1924
1964
1969
2001
1953
1965
1913
1930
1905
2010
1992
1913
1944


1996
1944
1934
1976
1949
1997
2011
1910
1910
1924
1950
1959
1979
1900
1988
1972
1970
1960
1951
1990
1958
1975
1967
1952
1994
1940
1969
1921
2013
1949
1962
1882
1974
1952
1950
1970
2012
1947
1932
1976
1964
2005
1934
1935
1902
1950
1938
1942
1989
1966
1941
1944
1918
1914
1961
1964
2004
2009
1961
1932
1981
1980
2004
1928
1938
1994
1974
1973
1965
1996
1925
1966
1927
1978
1970
1915
1926
1969
1981
1980
1935
1964
1985
1968
1967
1964
1924
1913
1900
1990
1979
1967
1936
1952
1953
1949
1995
1990
1957
1951
1898
1960
2001
1908
1971
1968
2000
1926
1955
1987
1956
1956
2005
1911
2016
1949
1976
1963
1898
2003
2001
1888
1986
1906
1928
1949
1937
1973
1924
1914
1913
1943
1969
1974
1981
1894
1953
2011
1970
1931
1976
1875
1978
1956
1967
1963
1988
2003
1960
1949
1898
1914
1995
1937
1985
1965
1986
1964
1938
1996
1964
1943
1920
1929
1913
1949
1968
1963
1992
1965
1930
1893
1944
1944
1976
1900
1925
1928
1949
1798
1966
1973
1950
1954
1911
1958
2005
1932
1818
1954
1984
2003
2001
1942
1938
2003
1998
1976
1980
1947


1932
1973
1926
1913
1931
1957
1900
1955
1940
2005
1956
1913
1900
1911
1971
1931
1975
1951
1972
1959
1898
1970
1984
2001
1922
1929
1972
1967
1922
1929
1865
1991
1864
1965
1924
1942
1951
1987
1975
1938
1925
2007
2016
1923
1992
1918
1893
1932
1949
2017
1997
1974
2008
1922
1938
1939
1990
1945
1932
1900
1956
1999
1960
1998
1942
1920
1981
1929
1903
1931
2002
1992
1965
1922
1923
1948
1981
1905
1930
1980
1987
1967
1983
1958
1971
1967
1959
1940
1962
1964
1898
2002
2002
1926
2000
1967
1919
1920
1910
1948
1980
1910
1930
1994
2001
1950
1977
1922
1956
1989
1976
1952
1934
1893
1920
2017
2001
1944
1895
2004
1940
1894
2005
1889
1922
1941
1964
1991
1942
1853
1936
1949
1967
1902
1999
1908
1914
1919
1926
1970
1931
1963
1952
1976
2004
1958
1958
1916
1962
1931
2009
1977
1925
1948
1985
1973
1977
1976
1959
2003
1853
1942
1962
1854
1938
2011
1894
1919
1970
1857
1988
1969
1950
1925
1926
2004
1927
1961
1959
1985
1911
1991
1965
2008
1989
1968
1911
1967
1966
1938
1902
1960
1989
1932
1964
1976
1901
1996
1954
1920


1805
1997
1926
1977
1962
1997
1854
1972
2005
1962
1972
1929
1963
1990
1937
1931
1909
1993
1973
2018
1982
1968
1934
1902
1979
1980
1922
1952
1963
1913
1947
1856
2008
1923
1975
1960
1964
1930
1914
1970
1967
1912
1962
1926
1966
1957
1952
1949
1974
1966
1978
1900
1908
2000
1932
2017
1949
2009
1997
2007
1904
1901
1955
1946
1975
1919
1962
1994
1989
1900
1963
1954
2011
1949
1932
2000
1944
1984
1915
1973
1986
1914
1971
1952
1992
1960
1970
1956
1970
1936
1930
1998
1986
1970
1946
1948
1962
1942
1991
1986
1951
1972
1942
1915
1902
1900
1859
2000
2003
1970
2002
1930
1928
1913
1982
1988
1967
1931
1919
1946
1999
1953
1993
1958
1960
1926
2012
1981
1967
1934
1896
1906
1968
1990
1920
1939
2010
1963
1900
1995
1936
1962
1983
1967
1959
1993
1914
2007
1924
1960
1900
1972
1973
1956
1946
1919
1992
1979
2011
1936
1944
1965
1973
2007
1958
1949
1990
1988
1928
1963
1937
1858
1962
2005
1996
2005
1944
1957
1939
1920
1913
1964
1976
1915
1979
1978
1998
1923
1953
1970
1978
1988
1922
1911
1972
1930
1954
1963
1911
1964


1985
1993
1980
2004
1966
1967
2003
1981
1974
1974
1974
1964
1952
1932
1906
1925
1937
1903
1978
1961
1964
2004
1969
1993
1931
1976
1979
1990
1971
1945
1940
1944
1966
1961
1996
1925
1936
1966
1928
1858
1998
1967
1970
1944
1902
1963
1969
1955
2012
1963
1951
1943
1926
1968
1962
1988
1976
1974
1991
2008
2012
2013
1965
1930
1921
2001
1949
1927
1985
1968
1990
1982
1940
1995
1917
2007
1960
1939
1936
1995
1953
1988
1844
1890
1999
1898
1935
1974
1976
1973
1986
1923
1976
1929
1924
2010
1934
1986
1934
1951
1937
1964
1998
1974
2001
1960
1995
1992
1949
1942
1953
1977
1973
1973
1938
1970
1948
1900
1934
1986
1940
1935
1894
1950
1997
1934
1979
1950
1920
2008
1900
1934
2004
1940
1984
2011
1944
1873
2003
1965
1967
1934
1972
1964
1990
1953
1937
1968
1941
1946
1951
2003
1947
1893
1935
1942
1971
1930
1959
1912
1912
1930
1924
1950
1935
1983
2003
1964
1942
1934
1944
1982
1995
1967
1923
1967
1934
1925
1998
1967
1984
1963
1977
2012
1902
1952
1980
1974
1901
1896
1910
1953
1934
1959
1973
1956
1982
1930
1949
2007


1958
1973
1950
1995
1976
1966
1940
1951
1982
1990
1990
1945
1928
1944
2009
1959
1968
1984
1925
1949
1953
1960
1952
1946
1984
2012
1961
1999
1963
1999
1974
1977
1967
1944
1934
1946
1982
2002
1934
1991
1936
1973
2013
1965
2006
2017
1932
1909
1953
1956
1966
1973
1932
1974
1864
2004
1969
1971
1917
2003
1998
1938
1935
1995
1965
1940
1931
2003
1926
1954
1949
1921
1974
1944
1949
1969
1991
1960
1856
1979
1935
1927
1927
1893
1906
1944
1942
1946
1962
1893
1913
1897
1950
1946
1905
1976
1928
1935
1926
1941
1984
1993
1966
2007
1961
2000
1974
1910
1996
1954
1979
1942
1928
2006
1950
1905
1996
1962
1964
1931
1968
1964
1990
1923
1923
1920
1973
1931
1973
2004
1949
1997
1999
1934
1950
2013
1973
1915
1962
2001
1950
1931
2006
1981
1979
1925
1974
1949
1964
1968
1965
1936
1967
1971
1962
1918
1971
1923
2000
1951
1904
1931
1917
2008
1959
1966
1966
2015
1976
1970
1949
1960
2006
1948
1929
1948
1966
1995
1955
1923
1931
2004
1934
1981
2000
1948
1923
1944
1937
1991
1963
1997
1954
1943
2002
1962
1969
1981
1932
1965


1967
1965
2007
1925
1949
1994
1975
1972
1934
1968
1972
1963
1975
1973
1929
1908
1938
1946
1964
1966
1947
1798
2002
1963
1963
1970
1964
1826
2013
1963
1924
1983
2007
1939
1940
1972
2004
1890
1949
1918
1929
1961
1930
1992
2002
1966
1960
1885
1946
1910
1990
1845
1944
1927
1911
1991
1900
1963
1899
1964
1930
1946
1934
1968
1947
1987
1983
2005
1981
2010
2000
1950
1959
1928
1926
1987
1963
1949
1998
2007
1938
1974
1903
2010
1912
1976
1930
1950
1953
1954
1988
1958
1936
1966
1973
1944
1930
1964
1972
2008
2004
2009
2010
1983
1970
1994
2013
1970
1965
2009
2004
1963
1958
1931
1970
2014
1948
2008
1964
1967
1959
1992
1970
1938
1997
1902
1926
2009
1994
1857
1936
1980
1932
1984
1876
1902
1930
1971
1964
1950
1984
1946
2005
2007
1936
1977
1988
1961
1868
1996
1944
1967
2011
2001
1948
1980
1971
1972
2006
2005
2013
1945
1991
1930
1966
1947
1967
1986
2001
1972
1930
1968
1972
1968
1993
1943
1986
2005
1962
1948
1949
1923
1968
1970
1968
2005
1914
1925
1930
1973
1927
1987
1928
1974
1914
1957
1985
1936
1986
1860


1914
1988
1920
1943
1980
1979
1926
1916
1885
2010
1976
1928
2012
1980
1962
1958
1964
1982
1968
2004
1940
1920
1939
1924
1905
1988
1979
1928
1998
1944
1980
1990
1965
1978
1973
1922
1991
1995
1946
1951
1950
1948
1966
1972
1971
1988
2007
1967
1909
1999
1949
2005
1938
1922
1900
2016
1973
1976
1920
2007
1976
1984
1952
1932
1900
1959
1993
1924
1970
1967
1938
1952
1988
1930
1968
1987
1984
1930
2014
1927
1949
1924
1874
1966
2009
1941
1964
1965
1925
1972
1978
1956
1940
1958
1945
1999
1986
1977
1930
1964
1908
1981
1962
1982
1979
1936
1980
1936
1925
1992
1905
1959
1977
1973
1987
1970
1907
2004
1982
1798
1963
1928
1818
1948
1859
2017
2004
1931
1963
1985
1895
1911
1983
2001
1818
1967
1964
1972
1968
2013
1961
1970
2009
1949
1860
1942
1953
1961
1954
1968
1931
1948
1967
1994
1970
1992
1978
1948
1999
1941
1933
1924
1967
1987
1905
1922
1978
1880
1931
1984
1931
1865
1926
1957
1972
1963
1966
1950
1987
2004
2016
1903
1945
1902
1925
1972
1947
1929
1907
1919
1980
1935
1942
1953
1930
1948
1979
1958
1846
1946


## CONCLUSION


- With this dataset, we can proceed with our analysis and start certain operations. In the next repository titled: **Python Data Analysis Basics**, I would take you through how we can use this processed dataset to conduct data analysis and format text data.


- In the meantime, get to practice with this steps and see what you can acheive. If you have better methods to approach all these steps, please let me know as I would be thrilled to hear from you.


- Till then, **HAPPY CODING!!!**