# DataStorage Assignment
#### Mohammad Shafi Nikzada - 3754101
#### Nils Jesper Cornelius - 3754102

### Link to GitHub
https://github.com/NikzadaShafi/Datastorage/blob/main/API_Kaggle_and_DataStorage.ipynb

## Kaggle API and Downloading the data

In [55]:
import pandas as pd
import opendatasets as od
import psycopg2

In [56]:
# Airbnb-property-data-from-texas
od.download("https://www.kaggle.com/datasets/PromptCloudHQ/airbnb-property-data-from-texas")

Skipping, found downloaded files in ".\airbnb-property-data-from-texas" (use force=True to force download)


In [57]:
# The data is in CSV format, read the data using pandas library
data = pd.read_csv('airbnb-property-data-from-texas/Airbnb_Texas_Rentals.csv')
data = data.drop(data.columns[[0]], axis=1)
print(data.shape)

(18259, 9)


In [None]:
# Visualize some rows of the data
data.head(2)

### How our data should be inserted to PostgreSQL?

In [58]:
#     average_rate_per_night          varchar, 
#     bedrooms_count                  varchar,
#     city                            varchar,
#     date_of_listing                 varchar,
#     description                     varchar,
#     latitude                        float,
#     longitude                       float,
#     title                           varchar
#     url                             varchar 

In [59]:
# Types of data that we have in our CSV file
data.dtypes

average_rate_per_night     object
bedrooms_count             object
city                       object
date_of_listing            object
description                object
latitude                  float64
longitude                 float64
title                      object
url                        object
dtype: object

In [60]:
# In order to insert the data to PostgreSQL we have to adapt this format
replacements = {
    'object': 'varchar',
    'float64': 'float'
}

# -----------------------------------------      New Section    --------------------------------------------


# Working with PostgreSQL and establishing connection

In [61]:
# Uncomment if you need to install the library
#!pip install psycopg2

In [62]:
# Forming the connection to PostgreSQL
conn = psycopg2.connect(
    database="DataStorage",
    user="postgres",
    password="341741",
    host="localhost",
    port = "5433"
    )

cursor = conn.cursor()
print("sucessfull connection")

sucessfull connection


In [63]:
# Deletes the table if it already exists
cursor.execute("drop table if exists airbnb_texas_rentals;")

In [64]:
# Create the table with columns
cursor.execute("create table airbnb_texas_rentals \
(average_rate_per_night varchar, bedrooms_count varchar, city varchar, date_of_listing varchar, \
description varchar, latitude float, longitude float, title varchar, url varchar)")
conn.commit()

In [65]:
# Create .CSV from dataframe
data.to_csv('airbnb_texas_rentals.csv', header=data.columns, index=False, encoding='utf-8')
my_file = open('airbnb_texas_rentals.csv', 'r', encoding='utf-8')


In [66]:
# Inserting the file to db
SQL_STATEMENT = """
COPY airbnb_texas_rentals FROM STDIN WITH
    CSV
    HEADER
    DELIMITER AS ','
"""
cursor.copy_expert(sql=SQL_STATEMENT, file=my_file)
print("file copied to db")

file copied to db


In [67]:
cursor.execute("grant select on table airbnb_texas_rentals to public")
conn.commit()

print('table airbnb_texas_rentals imported to db')

table airbnb_texas_rentals imported to db


# Using some SQL query to explore the data that we have stored

In [68]:
# SQL code as a string
sql = "SELECT * FROM airbnb_texas_rentals LIMIT 2;"

# Execute the SQL code
cursor.execute(sql)

# Fetch the results
rows = cursor.fetchall()

# Print the results
for row in rows:
    print(row)

# Close the cursor and connection
cursor.close()
conn.close()

('$27', '2', 'Humble', 'May 2016', 'Welcome to stay in private room with queen bed and detached private bathroom on the second floor. Another private bedroom with sofa bed is available for additional guests. 10$ for an additional guest.\\n10min from IAH airport\\nAirport pick-up/drop off is available for $10/trip.', 30.0201379199512, -95.29399600425128, '2 Private rooms/bathroom 10min from IAH airport', 'https://www.airbnb.com/rooms/18520444?location=Cleveland%2C%20TX')
('$149', '4', 'San Antonio', 'November 2010', 'Stylish, fully remodeled home in upscale NW – Alamo Heights Area. \\n\\nAmazing location - House conveniently located in quiet street, with beautiful seasoned trees, prestigious neighborhood and very close to the airport, 281, 410 loop and down-town area. \\n\\nFeaturing an open floor plan, original hardwood floors, 3 bedrooms, 3 FULL bathrooms + an independent garden-TV room which can sleep 2 more\\n\\nEuropean inspired kitchen and “top of the line” decor. Driveway can par

# -----------------------------------------      New Section    --------------------------------------------

# Working with MongoDB and establishing connection

#### To create a database and collection in MongoDB using Python, we need PyMongo library.

In [69]:
# Uncomment if you need to install pymongo
#!pip install pymongo

In [70]:
import pymongo
import opendatasets as od
import json

In [71]:
# Connect to the MongoDB server using the MongoClient() function.
client = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
# Now let's create a database using the client object.
db = client['ZARA_US_fashion_products']

# Create collection
collection = db['mycollection']


### Let's download a JSON data from Kaggle and try to insert it to MongoDB

In [72]:
# Download "ZARA US fashion products dataset"
od.download("https://www.kaggle.com/datasets/crawlfeeds/zara-us-fashion-products-dataset")

Skipping, found downloaded files in ".\zara-us-fashion-products-dataset" (use force=True to force download)


In [73]:
with open("zara-us-fashion-products-dataset/zara_us_sample_data.json", "r") as file:
    data = json.load(file, strict=False)

In [74]:
x = collection.insert_many(data)

### Some interesting queries

In [75]:
x = collection.find_one()
print(x)

{'_id': ObjectId('641e057bd3c089c63b918670'), 'url': 'https://www.zara.com/us/en/satin-effect-corset-bodysuit-p00219805.html', 'language': 'en-US', 'name': 'SATIN EFFECT CORSET BODYSUIT', 'sku': '128666521-966-2', 'mpn': '128666521-966-2', 'brand': 'ZARA', 'description': 'Bodysuit with sweetheart neckline and adjustable spaghetti straps. Bottom snap button closure.', 'price': '29.9', 'currency': 'USD', 'availability': 'InStock', 'condition': 'NewCondition', 'images': 'https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0

In [76]:
# To query the second word of the "color" field wich contains Mid-blue
# And limit the result to 2

myquery = {"color": {"$regex": "\sMid-blue\s"}}

mydoc = collection.find(myquery).limit(2)

for x in mydoc:
    print(x)

{'_id': ObjectId('641e057bd3c089c63b91867b'), 'url': 'https://www.zara.com/us/en/geometric-print-overshirt-p00840383.html', 'language': 'en-US', 'name': 'GEOMETRIC PRINT OVERSHIRT', 'sku': '142882821-427-2', 'mpn': '142882821-427-2', 'brand': 'ZARA', 'description': 'Relaxed fit overshirt with lapel collar and long sleeves. Patch pockets at chest and hip. Washed effect print. Front button closure.', 'price': '69.9', 'currency': 'USD', 'availability': 'InStock', 'condition': 'NewCondition', 'images': 'https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://static.zara.net/stdstatic/1.234.0-b.45/images/transparent-background.png~https://s

## **Data collection with API and storing it in MongoDB**


In [77]:
import requests
import pandas as pd
import pymongo
from pymongo import MongoClient

In [78]:
# Connect to the MongoDB server using the MongoClient() function.
client = pymongo.MongoClient("mongodb://127.0.0.1:27017/")
# Now let's create a 'Games' database using the client object.
db = client['Games']
# Create a new collection named "gameReviews"
reviews_collection = db["gameReviews"]

games_base = "http://www.gamespot.com/api/reviews/?api_key=8d7b808247f3bd6e8809ce1fcf29d7bbf943cfbe&format=json"
headers = {
    "user_agent": "Shafinikzada API Access"
}

In [79]:
# Now send a GET request to the GameSpot API and retrieve the review data
response = requests.get(games_base, headers=headers)
# Parse the JSON data returned from an API request into a Python dictionary
data = response.json()

# Iterate through the review data and insert it into the "gameReviews" collection
for review in data["results"]:
    review_data = {
        "game_title": review["game"]["name"],
        "review_text": review["body"],
        "rating": review["score"]
    }
    reviews_collection.insert_one(review_data)