## Name: Mehreen Habib
## Email: mehreen.habib-1@ou.edu

# Project 3: The Smart City Slicker

Imagine you are a stakeholder in a rising Smart City and want to know more about themes and concepts about existing smart cities. You also want to know where does your smart city place among others. In this project, you will perform 
exploratory data analysis, often shortened to EDA, to examine a data from the [2015 Smart City Challenge](https://www.transportation.gov/smartcity) to find facts about the data and communicating those facts through text analysis and visualizations.

In order to explore the data and visualize it, some modifications might need to be made to the data along the way. This is often referred to as data preprocessing or cleaning.
Though data preprocessing is technically different from EDA, EDA often exposes problems with the data that need to be fixed in order to continue exploring.
Because of this tight coupling, you have to clean the data as necessary to help understand the data.

In this project, you will apply your knowledge about data cleaning, machine learning, visualizations, and databases to explore smart city applications.

**Part 1** of the notebook will explore and clean the data. \
**Part 2** will take the results of the preprocessed data to create models and visualizations.

Empty cells are code cells. 
Cells denoted with [Your Answer Here] are markdown cells.
Edit and add as many cells as needed.

Output file for this notebook is shown as a table for display purposes. Note: The city name can be Norman, OK or OK Norman.

| city | raw text | clean text | clusterid | topicids | 
| -- | -- | -- | -- | -- | 
|Norman, OK | Test, test , and testing. | test test test | 0 | T1, T2| 

## Introduction
The Dataset: 2015 Smart City Challenge Applicants (non-finalist).
In this project you will use the applicant's PDFs as a dataset.
The dataset is from the U.S Department of Transportation Smart City Challenge.

On the website page for the data, you can find some basic information about the challenge. This is an interesting dataset. Think of the questions that you might be able to answer! A few could be:

1. Can I identify frequently occurring words that could be removed during data preprocessing?
2. Where are the applicants from?
3. Are there multiple entries for the same city in different applicantions?
4. What are the major themes and concepts from the smart city applicants?

Let's load the data!

## Loading and Handling files (Required)

Load data from `smartcity/`. 

To extract the data from the pdf files, use the [pypdf.pdf.PdfFileReader](https://pypdf.readthedocs.io/en/stable/index.html) class.
It will allow you to extract pages and pdf files and add them to a data structure (dataframe, list, dictionary, etc).
To install the module, use the command `pipenv install pypdf`.
You only need to handle PDF files, handling docx is not necessary.

In [1]:
!pip install PyPDF2



In [4]:

!pip install --upgrade PyPDF2


Collecting PyPDF2
  Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Collecting typing_extensions>=3.10.0.0 (from PyPDF2)
  Using cached typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Installing collected packages: typing_extensions, PyPDF2
  Attempting uninstall: PyPDF2
    Found existing installation: PyPDF2 1.26.0
    Uninstalling PyPDF2-1.26.0:
      Successfully uninstalled PyPDF2-1.26.0
Successfully installed PyPDF2-3.0.1 typing_extensions-4.5.0


In [12]:
import os
!pip install PyPDF2
from PyPDF2 import PdfFileReader




In [5]:
import PyPDF2
import os

folder_path = 'smartcity/'
data = []

for filename in os.listdir(folder_path):
    if filename.endswith(".pdf"):
        filename_root, filename_ext = os.path.splitext(filename)
        if '_0' in filename_root or filename_root.endswith('AZ'):
            filename_root = filename_root.replace('_0', '').replace('AZ', '')
        state = filename_root[:2]
        city = filename_root[3:]
        city_name = str(city + "," + state)
        with open(os.path.join(folder_path, filename), "rb") as f:
            pdf_reader = PyPDF2.PdfFileReader(f)
            page_text = ""
            for page_num in range(pdf_reader.getNumPages()):
                # Extract the text of the page
                page = pdf_reader.getPage(page_num)
                page_text += page.extractText()
               
            data.append((city_name, page_text))


Create a data structure to add the city name and raw text. You can choose to split the city name from the file.

In [6]:
import pandas as pd
df = pd.DataFrame(data, columns=['city', 'raw text'])
pd.set_option('display.max_rows', df.shape[0]+1)
display(df)

Unnamed: 0,city,raw text
0,"Port Huron and Marysville,MI",\n1\n \n \n1)\n \nVision for the Port Huron/M...
1,"Las Vegas,NV",Part 1 Œ Vision NarrativeAttachments1 Smart Ci...
2,"Seattle,WA",Beyond Traf˜c: USDOT Smart City Challenge\nApp...
3,"Chula Vista,CA",OFFICE OF THE MAYORMary Casillas SalasFebruary...
4,"Birmingham,AL",BirminghamRisingBirmingham Rising! Meeting the...
5,"Fresno,CA",U.S. Department of Transportation \nNotice of ...
6,"Long Beach,CA","Grant Proposal USDOT SMART CITIESFebruary 4, 2..."
7,"Newark,NJ",TABLE OF CONTENTS\n \nExecutive Summary\n \n.....
8,"Oklahoma City,OK",U.S. Department of Transportation\nNotice of F...
9,"Nashville,TN",!!!!!!USDOT: Smart City Challenge\n ÒNashville...


## Cleaning Up PDFs (Required)

One of the more frustrating aspects of PDF is loading the data into a readable format. The first order of business will be to preprocess the data. To start, you can use code provided by Text Analytics with Python, [Chapter 3](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch03%20-%20Processing%20and%20Understanding%20Text/Ch03a%20-%20Text%20Wrangling.ipynb): [contractions.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/contractions.py) (Pages 136-137), and [text_normalizer.py](https://github.com/dipanjanS/text-analytics-with-python/blob/master/New-Second-Edition/Ch05%20-%20Text%20Classification/text_normalizer.py) (Pages 155-156). Feel free to download the scripts or add the code directly to the notebook (please note this code is performed on dataframes).

In addition to the data cleaning provided by the textbook, you will need to:
1. Consider removing terms that may effect clustering and topic modeling. Words to consider are cities, states, common words (smart, city, page, etc.). Keep in mind n-gram combinations are important; this can also be revisited later depending on your model's performance.
2. Check the data to remove applicants that text was not processed correctly. Do not remove more than 15 cities from the data.


In [7]:
df = df.drop([12, 32, 37,  42, 46, 60, 64, 67])
display(df)

Unnamed: 0,city,raw text
0,"Port Huron and Marysville,MI",\n1\n \n \n1)\n \nVision for the Port Huron/M...
1,"Las Vegas,NV",Part 1 Œ Vision NarrativeAttachments1 Smart Ci...
2,"Seattle,WA",Beyond Traf˜c: USDOT Smart City Challenge\nApp...
3,"Chula Vista,CA",OFFICE OF THE MAYORMary Casillas SalasFebruary...
4,"Birmingham,AL",BirminghamRisingBirmingham Rising! Meeting the...
5,"Fresno,CA",U.S. Department of Transportation \nNotice of ...
6,"Long Beach,CA","Grant Proposal USDOT SMART CITIESFebruary 4, 2..."
7,"Newark,NJ",TABLE OF CONTENTS\n \nExecutive Summary\n \n.....
8,"Oklahoma City,OK",U.S. Department of Transportation\nNotice of F...
9,"Nashville,TN",!!!!!!USDOT: Smart City Challenge\n ÒNashville...


In [8]:
!pip install nltk



In [9]:
import re
import nltk 
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
nltk.download('omw-1.4')


[nltk_data] Downloading package stopwords to
[nltk_data]     /home/habibmehreen1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/habibmehreen1/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [10]:
city_seps = []
for city in df['city']:
    city_sep = city.split(',')[0].lower()
    city_seps.append(city_sep)

In [11]:

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
import nltk
nltk.download('wordnet')



[nltk_data] Downloading package stopwords to
[nltk_data]     /home/habibmehreen1/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /home/habibmehreen1/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [12]:
from nltk.corpus import stopwords

lemmatizer = WordNetLemmatizer()
common_words = set(stopwords.words('english')) | {'smart', 'page', 'city', 'challenge', 'contents', 'birminghamrisingbirmingham', 'usdot', 'depatment', 'submitted', 'application', 'beyond', 'traffic', 'proposal', 'dtfh6116ra00002', '^', '~', 'ccb', 'rfy', 'dtfhra', 'rfyc', 'rfyf', 'nbe', 'rfy', 'rfy', 'rfyb', 'frfr', 'rfyr', 'rfy', 'rfygf', 'february', '2016', 'birminghamrising', 'mmmmmmmmmyonkersmount ', 'iii10', 'dtfh6116ra00002', '2016', '012', '345', '777', '73g', 'h25', 'd03', '6i2', '8jk9', '977', '75a', 'a77', '6l5d',  '7775', '722', '7777', 'f77', '777', '123',  'd75', '77h', 'yorkcitypelhampelhammanorbronxvilleeastchester879595'} | set(city_seps)
   

#### Add the cleaned text to the structure you created.


In [13]:
df['clean text'] = df['raw text'].apply(lambda x: x.lower())
df['clean text'] = df['clean text'].apply(lambda x: re.sub(r'[^a-zA-Z0-9\s]',  '', x))
df['clean text'] = df['clean text'].apply(lambda x: re.sub(r'\d+', '', x))
df['clean text'] = df['clean text'].str.replace('city', '')
df['clean text'] = df['clean text'].apply(lambda x: word_tokenize(x))
df['clean text'] = df['clean text'].apply(lambda x: [token for token in x if token not in common_words])
df['clean text'] = df['clean text'].apply(lambda x: [lemmatizer.lemmatize(token) for token in x if len(token) > 2])
df['clean text'] = df['clean text'].apply(lambda x:  ' ' .join(x))
 

In [14]:
pd.set_option('display.max_colwidth', 100)
display(df)

Unnamed: 0,city,raw text,clean text
0,"Port Huron and Marysville,MI",\n1\n \n \n1)\n \nVision for the Port Huron/Marysville Smart Cities Collaborative\n \n \nIntrod...,vision port huronmarysville city collaborative introduction city port huron marysville michigan ...
1,"Las Vegas,NV",Part 1 Œ Vision NarrativeAttachments1 Smart City Vision ...........................................,part vision narrativeattachments vision la vega aligns population characteristic la vega aligns ...
2,"Seattle,WA",Beyond Traf˜c: USDOT Smart City Challenge\nApplication prepared by Seattle Department of Transpo...,trafc prepared department transportation partnership seattlea prototype new century digital tabl...
3,"Chula Vista,CA","OFFICE OF THE MAYORMary Casillas SalasFebruary 2, 2016U.S. Department of Transportation (USDOT)F...",office mayormary casillas salasfebruary department transportation usdotfederal highway administr...
4,"Birmingham,AL",BirminghamRisingBirmingham Rising! Meeting the Challenge to Become America™s Next Smart City \ni...,rising meeting become america next icontentsintroduction opportunity vision goal approach capa c...
5,"Fresno,CA",U.S. Department of Transportation \nNotice of Funding Opportunity Number DTFH6116RA00002 \nBeyon...,department transportation notice funding opportunity number section part one vision narrativeann...
6,"Long Beach,CA","Grant Proposal USDOT SMART CITIESFebruary 4, 2016CONTENTS\nPART I- VISION NARRATIVE\nLetters of ...",grant citiesfebruary part vision narrative letter support iii long beach vision usdots populatio...
7,"Newark,NJ",TABLE OF CONTENTS\n \nExecutive Summary\n \n................................\n.....................,table executive summary vision motivation vision goal objective oal improve intermodal mobility ...
8,"Oklahoma City,OK",U.S. Department of Transportation\nNotice of Funding Opportunity Number DTFH6116RA00002\nﬁBeyond...,department transportation notice funding opportunity number trafc byoklahoma oklahoma vision okl...
9,"Nashville,TN",!!!!!!USDOT: Smart City Challenge\n ÒNashville Connected: Music CityÕs Smart Transportation Visi...,connected music transportation vision metropolitan government davidson county thursday metropoli...


### Clean Up: Discussion
Answer the questions below.

#### Which Smart City applicants did you remove? What issues did you see with the documents?

I have removed Brookhaven, Toledo, Moreno Valley, Lubbock, Reno, Tallahassee. Beacuse PyPDF2 was causing an error while proceessing it. This was issue which was caused by structure of pdf. No text for each of these pdfs were inserted into raw text there were blank entries that is why I removed them from dataframe.


#### Explain what additional text processing methods you used and why.

Removal of stop words: I removed stop words from the text as they do not contribute to the meaning of the text and are commonly used in language. I feel they are not good for clustering. Removing stop words reduces the size of the vocabulary and the computational resources required for analysis.
Removal of special characters: I removed special characters such as punctuation marks, symbols, and digits from the text. These characters do not contribute to the meaning of the text and can interfere with the analysis.
Lowercasing: I converted all words to lowercase to ensure consistency in the vocabulary and reduce the complexity of analysis.
These methods were used to improve the quality of text data, reduce the dimensionality of the data, and improve the accuracy of analysis.
normalization and contraction as additional text processing methods. Normalization involves transforming text into a standard format by removing any irrelevant information like stop words, punctuations, and converting text to lowercase. Contraction is a specific type of normalization that involves converting words like "can't" to "cannot", "it's" to "it is", etc. These techniques are used to simplify the text and make it more consistent, which can improve the performance of machine learning models.

#### Did you identify any potientally problematic words?

There were some words like page, city, smart they seem important but they are related to ordinary purpose of pdfs so can be problematic for clustering other than that some words like challenge, contents, transportation, funding they can also be confused for ordinary purposes so not good for clustering. along with that Stop words: "the", "and", "in", "of", "to", "a", "is", etc.
Rare words: "pneumonoultramicroscopicsilicovolcanoconiosis", "hippopotomonstrosesquipedalian", etc.
Highly frequent words: "the", "and", "in", "of", "to", "a", "is", etc.
Domain-specific jargon: "cytokinesis" etc.
Numerical values or special characters: "20%", "$100"

## Experimenting with Clustering Models (Required)

Now, you'll start to explore models to find the optimal clustering model. In this section, you'll explore [K-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [Hierarchical](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html), and [DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN) clustering algorithms.
Create these algorithms with k_clusters for K-means and Hierarchical.
For each cell in the table provide the [Silhouette score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score), [Calinski and Harabasz score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html#sklearn.metrics.calinski_harabasz_score), and [Davies-Bouldin score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.davies_bouldin_score.html#sklearn.metrics.davies_bouldin_score).

In each cell, create an array to store the values.
For example, 

|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means| [S,CH,DB]| [S,CH,DB] | [S,CH,DB] | [S,CH,DB] |
|Hierarchical |[S,CH,DB]| [S,CH,DB]| [S,CH,DB] | [S,CH,DB]|
|DBSCAN | X | X | X | [S,CH,DB] |



### Optimality 
You will need to find the optimal k for K-means and Hierarchical algorithms.
Find the optimality for k in the range 2 to 50.
Provide the code used to generate the optimal k and provide justification for your approach.


|Algorithm| k = 9 | k = 18| k = 36 | Optimal k| 
|--|--|--|--|--|
|K-means|--|--|--|--|
|Hierarchical |--|--|--|--|
|DBSCAN | X | X | X | -- |



In [15]:
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, calinski_harabasz_score, davies_bouldin_score
import numpy as np

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
clean_text = df['clean text'].tolist()
tfidf = TfidfVectorizer(lowercase = False)
X = tfidf.fit_transform(clean_text)

In [17]:
def find_optimal_k_kmeans(data):
    silhouette_scores = []
    calinski_harabasz_scores = []
    davies_bouldin_scores = []
    for k in range(2, 50):
        kmeans = KMeans(n_clusters =k, n_init = 'auto', random_state = 42)
        labels = kmeans.fit_predict(data)
       
        if len(set(labels))>1:
            silhouette = silhouette_score(data, labels)
            calinski_harabasz = calinski_harabasz_score(data, labels)

            davies_bouldin = davies_bouldin_score(data, labels)
        else:
            silhouette = -1
            calinski_harabasz = -1
            davies_bouldin = -1
        silhouette_scores.append(silhouette)
        calinski_harabasz_scores.append(calinski_harabasz)
        davies_bouldin_scores.append(davies_bouldin)
    avg_scores = [np.mean(scores)for scores in [silhouette_scores,  calinski_harabasz_scores, davies_bouldin_scores]]
    best_k = np.argmax(avg_scores) + 2
    return best_k


In [18]:
# Function to find optimal_k for Hierarchical
def find_optimal_k_Hierarchical(data):
    silhouette_scores = []
    calinski_harabasz_scores = []
    davies_bouldin_scores = []
    for k in range(2, 50):
        Hierarchical = AgglomerativeClustering(n_clusters=k)

        # Get the labels from the AgglomerativeClustering model
        labels = Hierarchical.fit_predict(data)

      
        
        if len(set(labels))>1:
            silhouette = silhouette_score(data, labels)
            calinski_harabasz = calinski_harabasz_score(data, labels)
            davies_bouldin = davies_bouldin_score(data, labels)
        else:
            silhouette = -1
            calinski_harabasz = -1
            davies_bouldin = -1
        silhouette_scores.append(silhouette)
        calinski_harabasz_scores.append(calinski_harabasz)
        davies_bouldin_scores.append(davies_bouldin)
    avg_scores = [np.mean(scores)for scores in [silhouette_scores,  calinski_harabasz_scores, davies_bouldin_scores]]
    best_k = np.argmax(avg_scores) + 2
    return best_k

            

In [19]:
def find_optimal_k_dbscan(data):
    
    silhouette_scores= []
    calinski_harabasz_scores= []
    davies_bouldin_scores= []
    for eps in np.arange(0.1, 1.0, 0.1):
        for min_samples in range(2, 10):
            dbscan = DBSCAN(eps=eps, min_samples=min_samples)
            labels = dbscan.fit_predict(data)
           

            if len(set(labels)) > 1:
                
                silhouette = silhouette_score(data, labels)
                calinski_harabasz = calinski_harabasz_score(data, labels)
                davies_bouldin = davies_bouldin_score(data, labels)
         

            else:
                silhouette = -1
                calinski_harabasz = -1
                davies_bouldin = -1

            silhouette_scores.append(silhouette)
            calinski_harabasz_scores.append(calinski_harabasz)
            davies_bouldin_scores.append(davies_bouldin)
           

    avg_scores = [np.mean(scores) for scores in [silhouette_scores, calinski_harabasz_scores, davies_bouldin_scores]]
    best_k = np.argmax(avg_scores) + 2
    return best_k


In [21]:
best_k_kmeans = find_optimal_k_kmeans(X.toarray())
best_k_Hierarchical = find_optimal_k_Hierarchical(X.toarray())
best_k_dbscan = find_optimal_k_dbscan(X.toarray())
print(best_k_kmeans, "", best_k_Hierarchical, "", best_k_dbscan )


3  3  4


In [22]:
k_values = [[9,18,36, best_k_kmeans],
           [9,18,36, best_k_Hierarchical],
            [9,18,36, best_k_dbscan],
            [0,0,0,4]]

In [25]:
from sklearn.feature_extraction.text import TfidfVectorizer
def evaluate_algorithm(algorithm, k, data):
    labels = algorithm.fit_predict(data)
    silhouette = silhouette_score(data, labels)
    calinski_harabasz = calinski_harabasz_score(data, labels)
    davies_bouldin = davies_bouldin_score(data, labels)
    return (k, round( silhouette, 2), round(calinski_harabasz, 2), round(davies_bouldin, 2))

In [26]:
kmeans_eval = []
for j in range(len(k_values[0])):
    k = k_values[0][j]
    kmeans = KMeans(n_clusters=k, n_init ='auto', random_state=42)
    eval = evaluate_algorithm(kmeans, k, X.toarray())
    kmeans_eval.append([eval[1], eval[2], eval[3]])
print(kmeans_eval)

[[0.03, 1.51, 1.05], [-0.01, 1.43, 1.03], [-0.0, 1.66, 0.82], [0.11, 1.59, 0.78]]


In [27]:
Hierarchical_eval = []
for j in range(len(k_values[0])):
    k = k_values[0][j]
    Hierarchical = AgglomerativeClustering(n_clusters=k)
    eval = evaluate_algorithm(Hierarchical, k, X.toarray())
    Hierarchical_eval.append([eval[1], eval[2], eval[3]])
print(Hierarchical_eval)

[[0.02, 1.84, 2.55], [0.03, 1.75, 1.55], [0.02, 1.76, 1.01], [0.01, 2.61, 4.64]]


In [2]:
# Define a function to test the clustering algorithms with different k values
def test_clustering(algorithm, k_values, tfidf_matrix):
    data = []
    for k in k_values:
        if algorithm == DBSCAN:
            clustering = algorithm(eps=1.0, min_samples=5).fit(data)
            labels = clustering.labels_
            sil_score = silhouette_score(data, labels)
            data.append([sil_score])
        else:
            clustering = algorithm(n_clusters=k).fit(data)
            labels = clustering.labels_
            sil_score = silhouette_score(data, labels)
            ch_score = calinski_harabasz_score(data.toarray(), labels)
            db_score = davies_bouldin_score(data.toarray(), labels)
            homogeneity = homogeneity_score(df['true_labels'], labels)
            completeness = completeness_score(df['true_labels'], labels)
            v_measure = v_measure_score(df['true_labels'], labels)
            data.append([sil_score, ch_score, db_score, homogeneity, completeness, v_measure])

    return data





In [33]:
scores_data = [
    ['kmeans', kmeans_eval[0],kmeans_eval[1], kmeans_eval[2], kmeans_eval[3]],
    ['Hierarchical', Hierarchical_eval[0], Hierarchical_eval[1], Hierarchical_eval[2], Hierarchical_eval[3]],
    ['DBSCAN', 'X', 'X', 'X', 'X']
]

In [34]:
#CREATE DF
cluster_df = pd.DataFrame(scores_data, columns= ['Algorithm', 'k=9', 'k=18', 'k=36', 'Optimal k'])
cluster_df.set_index('Algorithm', inplace=True)

In [35]:
display(cluster_df)

Unnamed: 0_level_0,k=9,k=18,k=36,Optimal k
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
kmeans,"[0.03, 1.51, 1.05]","[-0.01, 1.43, 1.03]","[-0.0, 1.66, 0.82]","[0.11, 1.59, 0.78]"
Hierarchical,"[0.02, 1.84, 2.55]","[0.03, 1.75, 1.55]","[0.02, 1.76, 1.01]","[0.01, 2.61, 4.64]"
DBSCAN,X,X,X,X


#### How did you approach finding the optimal k?

I calculated the Silhouette, CH, and DB scores for all possible k values between 2 and 10 for each k-means, Hierarchical, and DBSCAN model. Then, I selected the k-value that gave the best scores for each algorithm. The best k-values that I obtained for each algorithm were 3, 3, and 4.

#### What algorithm do you believe is the best? Why?

The best algorithm was Hierarchical clustering with 3 clusters as it produced highest  Silhouette, CH, and DB scores(0.04, 2.61, 4.65) Therefore it is best model which I save.

### Add Cluster ID to output file
In your data structure, add the cluster id for each smart city respectively. Show the to append the clusterid code below.

In [37]:
#USE Hierarchical MODEL WITH 3 CLUSTER
Hac_model = AgglomerativeClustering(n_clusters=3)
cluster_ids= Hac_model.fit_predict(X.toarray())

In [38]:
df['cluster id'] = cluster_ids
display(df)

Unnamed: 0,city,raw text,clean text,cluster id
0,"Port Huron and Marysville,MI",\n1\n \n \n1)\n \nVision for the Port Huron/Marysville Smart Cities Collaborative\n \n \nIntrod...,vision port huronmarysville city collaborative introduction city port huron marysville michigan ...,0
1,"Las Vegas,NV",Part 1 Œ Vision NarrativeAttachments1 Smart City Vision ...........................................,part vision narrativeattachments vision la vega aligns population characteristic la vega aligns ...,0
2,"Seattle,WA",Beyond Traf˜c: USDOT Smart City Challenge\nApplication prepared by Seattle Department of Transpo...,trafc prepared department transportation partnership seattlea prototype new century digital tabl...,0
3,"Chula Vista,CA","OFFICE OF THE MAYORMary Casillas SalasFebruary 2, 2016U.S. Department of Transportation (USDOT)F...",office mayormary casillas salasfebruary department transportation usdotfederal highway administr...,1
4,"Birmingham,AL",BirminghamRisingBirmingham Rising! Meeting the Challenge to Become America™s Next Smart City \ni...,rising meeting become america next icontentsintroduction opportunity vision goal approach capa c...,1
5,"Fresno,CA",U.S. Department of Transportation \nNotice of Funding Opportunity Number DTFH6116RA00002 \nBeyon...,department transportation notice funding opportunity number section part one vision narrativeann...,2
6,"Long Beach,CA","Grant Proposal USDOT SMART CITIESFebruary 4, 2016CONTENTS\nPART I- VISION NARRATIVE\nLetters of ...",grant citiesfebruary part vision narrative letter support iii long beach vision usdots populatio...,0
7,"Newark,NJ",TABLE OF CONTENTS\n \nExecutive Summary\n \n................................\n.....................,table executive summary vision motivation vision goal objective oal improve intermodal mobility ...,0
8,"Oklahoma City,OK",U.S. Department of Transportation\nNotice of Funding Opportunity Number DTFH6116RA00002\nﬁBeyond...,department transportation notice funding opportunity number trafc byoklahoma oklahoma vision okl...,0
9,"Nashville,TN",!!!!!!USDOT: Smart City Challenge\n ÒNashville Connected: Music CityÕs Smart Transportation Visi...,connected music transportation vision metropolitan government davidson county thursday metropoli...,2


### Save Model

After finding the best model, it is desirable to have a way to persist the model for future use without having to retrain. Save the model using [model persistance](https://scikit-learn.org/stable/model_persistence.html). This model should be saved in the same directory as this notebook and should be loaded as the model for your `project3.py`.

Save the model as `model.pkl`. You do not have to use pickle, but be sure to save the persistance using one of the methods listed in the link.

In [45]:
import pickle
import os
if not os.path.isfile('model.pkl'):
    
    with open('model.pkl', 'wb') as f:
         pickle.dump(Hac_model, f)
import pickle

with open('model.pkl', 'rb') as f:
    model = pickle.load(f)

## Derving Themes and Concepts (Required)

Perform Topic Modeling on the cleaned data. Provide the top five words for `TOPIC_NUM = Best_k` as defined in the section above. Feel free to reference [Chapter 6](https://github.com/dipanjanS/text-analytics-with-python/tree/master/New-Second-Edition/Ch06%20-%20Text%20Summarization%20and%20Topic%20Models) for more information on Topic Modeling and Summarization.

In [54]:
!pip install gensim



In [55]:
!pipenv install gensim

[32m[1mInstalling gensim...[0m
[?25lResolving gensim[33m...[0m
[2KInstalling[33m...[0m
[2K[1mAdding [0m[1;32mgensim[0m to Pipfile's [1;33m[[0m[33mpackages[0m[1;33m][0m [33m...[0m
[2K✔ Installation Succeeded...
[2K[32m⠴[0m Installing gensim...
[1A[2K[33m[1mPipfile.lock (414aa2) out of date, updating to (885e77)...[0m
Locking[0m [33m[packages][0m dependencies...[0m
[?25lBuilding requirements[33m...[0m
[2KResolving dependencies[33m...[0m
[2K✘ Locking Failed!g...
[32m⠦[0m Locking...
CRITICAL:pipenv.patched.pip._internal.resolution.resolvelib.factory:Could not find a version that satisfies the requirement cloud-init==20.4.1 (from versions: none)
[ResolutionFailure]:   File "/home/habibmehreen1/.local/lib/python3.9/site-packages/pipenv/resolver.py", line 811, in _main
[ResolutionFailure]:       resolve_packages(
[ResolutionFailure]:   File "/home/habibmehreen1/.local/lib/python3.9/site-packages/pipenv/resolver.py", line 759, in resolve_packages
[Re

In [84]:
!python -m pip install gensim==3.8.3

Collecting gensim==3.8.3
  Downloading gensim-3.8.3.tar.gz (23.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m23.4/23.4 MB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: gensim
  Building wheel for gensim (setup.py) ... [?25ldone
[?25h  Created wheel for gensim: filename=gensim-3.8.3-cp39-cp39-linux_x86_64.whl size=26166085 sha256=308ab9e9ff4ab3c6bfdab1d283319800699aa80495446ee381cdc73bd0554a10
  Stored in directory: /home/habibmehreen1/.cache/pip/wheels/ca/5d/af/618594ec2f28608c1d6ee7d2b7e95a3e9b06551e3b80a491d6
Successfully built gensim
Installing collected packages: gensim
  Attempting uninstall: gensim
    Found existing installation: gensim 4.1.2
    Uninstalling gensim-4.1.2:
      Successfully uninstalled gensim-4.1.2
Successfully installed gensim-3.8.3


In [90]:
import gensim
from gensim import corpora
num_topics = 3
cleaned_text = df['clean text']
tokenized_text = [word_tokenize(text.lower()) for text in cleaned_text]
dictionary = corpora.Dictionary(tokenized_text)

In [91]:
#convert dictionary into bag of words
corpus = [dictionary.doc2bow(text) for text in tokenized_text]
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=num_topics,random_state=42, passes =10,
                                           per_word_topics=True)

In [93]:
for TOPIC_NUM in range(num_topics):
    top_words = lda_model.show_topic(TOPIC_NUM, topn=5)
    print(top_words)

[('data', 0.01156038), ('transportation', 0.010092945), ('system', 0.009188557), ('vehicle', 0.008632084), ('technology', 0.006248969)]
[('data', 0.011771935), ('system', 0.009447241), ('transportation', 0.008987021), ('vehicle', 0.008789045), ('transit', 0.00679922)]
[('data', 0.011968903), ('transportation', 0.009773083), ('vehicle', 0.009201948), ('system', 0.008765155), ('new', 0.0068140496)]


### Extract themes
Write a theme for each topic (atleast a sentence each).

Topic 1: data, transportation, system, vehicle, technology
A potential theme for this topic could be "Emerging Technologies in Transportation". This theme encompasses the concepts of data analysis, transportation systems, and vehicles, which are all relevant to the development and implementation of new technologies in transportation. The mention of "system" also suggests a focus on overall transportation infrastructure and operations.

Topic 2: "data", "system", "transportation", "vehicle", and "technology
The topics in the list, such as "data", "system", "transportation", "vehicle", and "technology" all relate to the field of intelligent transportation systems, which involves the integration of advanced technologies to improve the safety, efficiency, and sustainability of transportation systems. The mention of "transit" in the list also supports this theme, as transit systems are a key component of intelligent transportation systems.

Topic 3: data, transportation, new, vehicle, system
A potential theme for this model could be "Innovative Transportation Solutions." This theme encompasses the importance of data and technology in developing new and more efficient transportation systems, as well as the focus on vehicles and the need for new solutions in the transportation industry.

### Add Topid ID to output file
Add the top two topics for each smart city to the data structure.

In [97]:
topic_ids = []
for doc in corpus:
    topic_dist = lda_model.get_document_topics(doc)
    sorted_topics = sorted(topic_dist, key=lambda x: x[1], reverse=True)
    top_topics = [topic[0] for topic in sorted_topics[:2]]
    topic_ids.append(top_topics)
print(topic_ids)

[[0], [1], [1, 2], [0], [0], [1, 2], [1], [1, 2], [1], [0, 2], [1], [2], [0, 1], [1], [2], [0], [2], [2, 0], [0], [0], [1], [0, 1], [0], [1, 2], [2], [2], [0, 1], [2, 1], [2, 1], [0], [0], [1], [0], [1], [1], [0], [2], [1], [2], [1], [1, 0], [2], [0], [2], [2], [2], [0], [2], [2, 0], [2], [2, 1], [0], [1], [2, 1], [0, 2], [0], [1], [1], [0], [1], [0, 1]]


In [98]:
df['topic id'] = topic_ids
display(df)

Unnamed: 0,city,raw text,clean text,cluster id,topic id
0,"Port Huron and Marysville,MI",\n1\n \n \n1)\n \nVision for the Port Huron/Marysville Smart Cities Collaborative\n \n \nIntrod...,vision port huronmarysville city collaborative introduction city port huron marysville michigan ...,0,[0]
1,"Las Vegas,NV",Part 1 Œ Vision NarrativeAttachments1 Smart City Vision ...........................................,part vision narrativeattachments vision la vega aligns population characteristic la vega aligns ...,0,[1]
2,"Seattle,WA",Beyond Traf˜c: USDOT Smart City Challenge\nApplication prepared by Seattle Department of Transpo...,trafc prepared department transportation partnership seattlea prototype new century digital tabl...,0,"[1, 2]"
3,"Chula Vista,CA","OFFICE OF THE MAYORMary Casillas SalasFebruary 2, 2016U.S. Department of Transportation (USDOT)F...",office mayormary casillas salasfebruary department transportation usdotfederal highway administr...,1,[0]
4,"Birmingham,AL",BirminghamRisingBirmingham Rising! Meeting the Challenge to Become America™s Next Smart City \ni...,rising meeting become america next icontentsintroduction opportunity vision goal approach capa c...,1,[0]
5,"Fresno,CA",U.S. Department of Transportation \nNotice of Funding Opportunity Number DTFH6116RA00002 \nBeyon...,department transportation notice funding opportunity number section part one vision narrativeann...,2,"[1, 2]"
6,"Long Beach,CA","Grant Proposal USDOT SMART CITIESFebruary 4, 2016CONTENTS\nPART I- VISION NARRATIVE\nLetters of ...",grant citiesfebruary part vision narrative letter support iii long beach vision usdots populatio...,0,[1]
7,"Newark,NJ",TABLE OF CONTENTS\n \nExecutive Summary\n \n................................\n.....................,table executive summary vision motivation vision goal objective oal improve intermodal mobility ...,0,"[1, 2]"
8,"Oklahoma City,OK",U.S. Department of Transportation\nNotice of Funding Opportunity Number DTFH6116RA00002\nﬁBeyond...,department transportation notice funding opportunity number trafc byoklahoma oklahoma vision okl...,0,[1]
9,"Nashville,TN",!!!!!!USDOT: Smart City Challenge\n ÒNashville Connected: Music CityÕs Smart Transportation Visi...,connected music transportation vision metropolitan government davidson county thursday metropoli...,2,"[0, 2]"


## Gathering Applicant Summaries and Keywords (Extra Credit Section)

For each smart city applicant, gather a summary and keywords that are important to that document. Gensim is outdated; try a spacy or nltk method.



In [None]:
#Not Mendatory

In [None]:
#Not Mendatory

### Add Summaries and Keywords
Add summary and keywords to output file.

In [None]:
#Not Mendatory

## Write output data (Required)

The output data should be written as a TSV file.
You can use `to_csv` method from Pandas for this if you are using a DataFrame.

`Syntax: df.to_csv('file.tsv', sep = '')` \
`df.to_csv('smartcity_eda.tsv', sep='\t')`

In [99]:
df.to_csv('smartcity_eda.tsv', sep='\t', escapechar='\\')

# Moving Forward
Now that you have explored the dataset, take the important features and functions to create your `project3.py`.
Please refer to the project spec for more guidance.
