Data Source: 
* NYC Open Data API for restaurants data
* https://dev.socrata.com/foundry/data.cityofnewyork.us/43nn-pn8j
* NewsAPI for articles (json files)
* https://newsapi.org/

Technologies:
1. **Request API** from NewsAPI to get "food" related articles & NYC Open Data API for restaurants data
2. Store articles data and restaurants data in **MongoDB**
3. **Flask** to get user input cuisine and zipcode, output restaurants and articles.

Cannot be run on Google Collab because localhost

# NYC Restaurants data

Import data from Socrata Open Data API, give it a minute to load.

In [1]:
#pip install sodapy

In [2]:
# Ignore warning
from sodapy import Socrata
import pandas as pd

# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofnewyork.us", None)

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("43nn-pn8j", limit=300000)

# Convert to pandas DataFrame
restaurants = pd.DataFrame.from_records(results)



In [3]:
# Number of rows
len(restaurants)

224563

In [4]:
restaurants.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,...,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta
0,50105039,CHINA CHEF CHEN,Brooklyn,1073,FLATBUSH AVENUE,11226,7182878173,Chinese,2023-01-09T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Re-inspection,40.644454931217,-73.95795139689,314,40,79200,3118908,3051650087,BK95
1,50111234,AIRPORT BAGELS & DELI,Queens,8420,ASTORIA BLVD,11370,6468069605,American,2022-02-08T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Pre-permit (Operational) / Initial Inspection,40.764612087724,-73.884845497109,403,22,32900,4024027,4010960022,QN28
2,50109894,Tokyo Sushi,Manhattan,151,RIVINGTON STREET,10002,6468820152,Japanese,2023-02-14T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Re-inspection,40.719240393784,-73.985645741014,103,1,1402,1004172,1003480015,MN28
3,50032701,KUU,Manhattan,20,JOHN STREET,10038,2125717177,Japanese,2022-03-29T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Initial Inspection,40.709837370946,-74.008876810465,101,1,1502,1001104,1000650022,MN25
4,41232125,VELVET LOUNGE,Brooklyn,174,BROADWAY,11211,7183024427,American,2022-02-16T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Initial Inspection,40.710010026465,-73.962602595341,301,34,54900,3059532,3021320020,BK73


### Clean data

* Columns dba(restaurant name), cuisine_description, and zipcode contain missing values
* Lowercase cuisine_description for easier search
* Duplicate restaurant records, drop duplicates

In [5]:
# Ignore warning
columns_to_check = ['dba', 'cuisine_description', 'zipcode']
df = restaurants.dropna(subset=columns_to_check)

df['zipcode'] = df['zipcode'].astype(int).astype(str)
df.loc[:,'cuisine_description'] = df['cuisine_description'].str.lower()

restaurants_cleaned = df.drop_duplicates(subset=['dba'], keep='first')  #Keeps only the first occurrence

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['zipcode'] = df['zipcode'].astype(int).astype(str)


In [6]:
# Number of restaurants
len(restaurants_cleaned)

20769

In [7]:
restaurants_cleaned.head()

Unnamed: 0,camis,dba,boro,building,street,zipcode,phone,cuisine_description,inspection_date,action,...,record_date,inspection_type,latitude,longitude,community_board,council_district,census_tract,bin,bbl,nta
0,50105039,CHINA CHEF CHEN,Brooklyn,1073,FLATBUSH AVENUE,11226,7182878173,chinese,2023-01-09T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Re-inspection,40.644454931217,-73.95795139689,314,40,79200,3118908,3051650087,BK95
1,50111234,AIRPORT BAGELS & DELI,Queens,8420,ASTORIA BLVD,11370,6468069605,american,2022-02-08T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Pre-permit (Operational) / Initial Inspection,40.764612087724,-73.884845497109,403,22,32900,4024027,4010960022,QN28
2,50109894,Tokyo Sushi,Manhattan,151,RIVINGTON STREET,10002,6468820152,japanese,2023-02-14T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Re-inspection,40.719240393784,-73.985645741014,103,1,1402,1004172,1003480015,MN28
3,50032701,KUU,Manhattan,20,JOHN STREET,10038,2125717177,japanese,2022-03-29T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Initial Inspection,40.709837370946,-74.008876810465,101,1,1502,1001104,1000650022,MN25
4,41232125,VELVET LOUNGE,Brooklyn,174,BROADWAY,11211,7183024427,american,2022-02-16T00:00:00.000,Violations were cited in the following area(s).,...,2024-04-17T06:00:13.000,Cycle Inspection / Initial Inspection,40.710010026465,-73.962602595341,301,34,54900,3059532,3021320020,BK73


# Articles: 

Since this API permits request as far back as one month ago, calculate date of one month ago

In [8]:
#pip install python-dateutil

In [9]:
import datetime
from dateutil.relativedelta import relativedelta

# Get one month ago's date
today_date = datetime.date.today()
one_month_ago = today_date - relativedelta(months=1)
one_month_ago_str = one_month_ago.strftime('%Y-%m-%d')

In [10]:
import requests

# Keywords for now: dining, cuisine, restaurant, recipe. 
# Add more or modify to improve search results.
url = (
    f'https://newsapi.org/v2/everything?'
    f'q=(dining OR cuisine OR restaurant OR recipe)&'
    f'from={one_month_ago_str}&'
       'apiKey=7152e173864048d6934a8df325418c66')

response = requests.get(url)
data = response.json()
articles = data.get('articles', [])

In [11]:
data

{'status': 'ok',
 'totalResults': 20911,
 'articles': [{'source': {'id': 'fox-news', 'name': 'Fox News'},
   'author': 'Kurt Knutsson, CyberGuy Report',
   'title': 'Restaurant combines an amusement ride with unforgettable fine dining',
   'description': 'Eatrenalin is a restaurant in Germany with a Floating Chair innovation that makes the 17,000-square-foot venue feel like an amusement park.',
   'url': 'https://www.foxnews.com/tech/restaurant-combines-amusement-ride-with-unforgettable-fine-dining',
   'urlToImage': 'https://static.foxnews.com/foxnews.com/content/uploads/2024/04/2-The-roaming-restaurant-experience-that-lets-you-go-from-room-to-room-without-ever-leaving-your-seat.jpg',
   'publishedAt': '2024-04-10T10:00:16Z',
   'content': "Ready for an amazing restaurant experience that'll take your taste buds on a wild ride as you move from one incredible room to the next? Sounds like an amusement park experience, right?\r\nIf you thoug… [+3983 chars]"},
  {'source': {'id': 'time', 

# Load data into MongoDB

In [12]:
from pymongo import MongoClient
client = MongoClient('localhost',27017)
db = client.proj5400
collection_articles = db.articles_data 
collection_restaurants = db.restaurants_data

In [13]:
# Insert articles data into MongoDB
if articles:
    collection_articles.insert_many(articles)
else:
    print("No data to insert.")

In [14]:
# Insert restaurants data into MongoDB
restaurants_dict = restaurants_cleaned.to_dict('records')
collection_restaurants.insert_many(restaurants_dict)

InsertManyResult([ObjectId('662152e9485ce7a81bfe4b3f'), ObjectId('662152e9485ce7a81bfe4b40'), ObjectId('662152e9485ce7a81bfe4b41'), ObjectId('662152e9485ce7a81bfe4b42'), ObjectId('662152e9485ce7a81bfe4b43'), ObjectId('662152e9485ce7a81bfe4b44'), ObjectId('662152e9485ce7a81bfe4b45'), ObjectId('662152e9485ce7a81bfe4b46'), ObjectId('662152e9485ce7a81bfe4b47'), ObjectId('662152e9485ce7a81bfe4b48'), ObjectId('662152e9485ce7a81bfe4b49'), ObjectId('662152e9485ce7a81bfe4b4a'), ObjectId('662152e9485ce7a81bfe4b4b'), ObjectId('662152e9485ce7a81bfe4b4c'), ObjectId('662152e9485ce7a81bfe4b4d'), ObjectId('662152e9485ce7a81bfe4b4e'), ObjectId('662152e9485ce7a81bfe4b4f'), ObjectId('662152e9485ce7a81bfe4b50'), ObjectId('662152e9485ce7a81bfe4b51'), ObjectId('662152e9485ce7a81bfe4b52'), ObjectId('662152e9485ce7a81bfe4b53'), ObjectId('662152e9485ce7a81bfe4b54'), ObjectId('662152e9485ce7a81bfe4b55'), ObjectId('662152e9485ce7a81bfe4b56'), ObjectId('662152e9485ce7a81bfe4b57'), ObjectId('662152e9485ce7a81bfe4b

# Flask input and output

In [15]:
from flask import Flask, request, render_template
app = Flask("Interactive App")

@app.route('/', methods=['GET'])
def my_form():
    return render_template("search_form.html")

@app.route('/', methods=['POST'])
def search_articles():
    search_term = request.form['search_term']
    filter_term = request.form['filter_term']  

    article_query = {
        '$or': [
            {'title': {'$regex': search_term, '$options': 'i'}},
            {'description': {'$regex': search_term, '$options': 'i'}},
            {'content': {'$regex': search_term, '$options': 'i'}}
        ]
    }
    results = collection_articles.find(article_query)
    articles = list(results)
    
    restaurant_query = {
        '$and':[
            {'cuisine_description': {'$regex': search_term}},
            {'zipcode': {'$regex': filter_term}}
        ]
    }
    results2 = collection_restaurants.find(restaurant_query)
    restaurants = list(results2)
    
    return render_template('results.html', articles=articles, restaurants=restaurants)

In [16]:
app.run(host='localhost', port=5002)

 * Serving Flask app 'Interactive App'
 * Debug mode: off


 * Running on http://localhost:5002
INFO:werkzeug:[33mPress CTRL+C to quit[0m
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 13:06:00] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 13:06:38] "POST / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 13:06:41] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 15:41:24] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 15:42:03] "POST / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 15:42:16] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 16:00:43] "POST / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 16:00:48] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 16:00:52] "POST / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 16:01:00] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 16:01:03] "POST / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - - [18/Apr/2024 16:01:09] "GET / HTTP/1.1" 200 -
INFO:werkzeug:127.0.0.1 - -