# <center>Comparing Constitutions<center>

In 2015, Nepal promulgated a new constitution replacing an interim one that was in force since 2008. The process of replacing the **Constitution of Nepal 1990** took almost 8 years. 
In this exercise, I use simple metrices such as cosine distance and k-means clustering to see how far apart the current and the former constitutions of Nepal are. In the first step, I scrape the constitutions from all around the world using, **https://www.constituteproject.org/**. Then I consolidate all the text files into a single data frame before importing it into **R** and soing the text analysis.

### <center>This Part Converts All the Text Files to One CSV Document<center>

In [1]:
#Import the necessary modules
import pandas as pd
import glob
import os
import re

In [2]:
#Get all the filenames with the extension .txt
os.chdir('C:/Users/Sushant/Desktop/Data_Science_R/Constitution_Comparison/consts')
files = glob.glob('*txt')
files[0:10]

['Afghanistan 2004.txt',
 'Albania 1998 (rev. 2016).txt',
 'Algeria 1989 (reinst. 1996, rev. 2016).txt',
 'Andorra 1993.txt',
 'Angola 2010.txt',
 'Antigua and Barbuda 1981.txt',
 'Argentina 1853 (reinst. 1983, rev. 1994).txt',
 'Armenia 1995 (rev. 2015).txt',
 'Australia 1901 (rev. 1985).txt',
 'Austria 1920 (reinst. 1945, rev. 2013).txt']

In [3]:
#Initialize necessary lists for appending data
country = []
year = []
content = []

for file in files:
        #Get the first four digits in the name of file and append to date
        date = re.findall('[0-9]{4}', file)
        year.append(date[0])

        '''Split the name of file by first four date and 
        add the first part of splitted list to country name'''
        
        name = re.split ('[0-9]{4}', file)
        name = re.sub('\s+$', '', name[0])
        country.append(name.lower())

        #read the file
        text_ = open(file, 'r')
        text = text_.read()
        #check if the constitution has a preamble
        try:
            no_intro = text.split('PREAMBLE\n')
            
            '''
            substitute / by space because it combines 
            stopwords later in the processing and makes it
            difficult to remove using the list of stopwords
            
            Also remove the last 20 words from each constitution
            as they relate to copyright
            '''
            no_intro = re.sub('/', ' ', no_intro[1])
            no_intro = re.sub('[\n]', ' ', no_intro)
            tokenized = no_intro.split(' ')
            no_intro = ' '.join(tokenized[:-20])
            content.append(no_intro.lower())
            
        except:
            no_intro = re.sub('/', ' ', text)
            no_intro = re.sub('[\n]', ' ', no_intro)
            tokenized = no_intro.split(' ')
            no_intro = ' '.join(tokenized[:-20])
            content.append(no_intro.lower())
        
#Make a dictionary and convert it to data frame
constitutions = {'Country': country,
                    'Year': year,
                    'Constitution':content}

consti_csv = pd.DataFrame(data = constitutions)
cols = ['Country', 'Year', 'Constitution']
consti_csv = consti_csv [cols]

### Check if everything went well

In [4]:
#Check to see if the things are workinga
print(consti_csv.at [122, 'Country'])
print(consti_csv.at [122, 'Year'])

nepal old
1990


In [5]:
consti_csv.to_csv('Constitutions.csv')

In [6]:
consti_csv.head()

Unnamed: 0,Country,Year,Constitution
0,afghanistan,2004,"in the name of allah, the most beneficent, the..."
1,albania,1998,"we, the people of albania, proud and aware of ..."
2,algeria,1989,"the algerian people is a free people, decided ..."
3,andorra,1993,"the andorran people, with full liberty and ind..."
4,angola,2010,"we, the people of angola, through its lawful r..."


In [None]:
consti_csv.at[122, 'Constitution']