# Exercise 4: Data Processing Systems

Your name: Sayeeda Begam Mohamed Ikbal

In this exercise, we will get to know and work with some industry-standard data processing systems.

We will develop a small application on top of the NYTD data set. This includes:
 - Training an ML model based on location, time and weather data to predict how many taxis will be at a certain location in New York
 - Searching for specific landmarks in New York with ElasticSearch
 - Retrieving the coordinates of the landmarks and converting them to the area identifiers used in the NYTD data set
 - Taking the weather forecast into account and predicting how many taxis will be in a certain area



We will also use a bunch of data processing systems. This includes a classical database with a GEO(graphy)-extension loaded (often referred to as GIS: Geographic Information System), and a full-text-search service.

Let's take a look at full-text-search first:

The idea is kind of similar to the Google web search: you can search for certain terms in a huge text corpus. This is done by indexing the whole corpus and matching search terms against it. Fancy operations might be supported like with fuzzy matching (ignores small typos or other differences), natural language matching (normalizes text and potentially replaces terms with semantically equivalent words) and other cool stuff! Sometimes, even regex or custom tokenizers are supported. Custom tokenizers can be useful for e.g. IP-addresses as they can appear in different forms that are all semantically equivalent. A popular service for full-text-search is ElasticSearch, which is open source.

Next, a few words about GIS:

Handling geography in databases is difficult, but solves some very important use cases. Therefore, many databases either support geography out-of-the-box or via extensions. Postgres supports it via the PostGIS extension. Let's have a look at why this is difficult:
 - Latitude and longitude are special numbers that are bound between -180/180 and -90/90 and they wrap around.
 - When describing areas, there is not one standardized way, but multiple, e.g., WKB (well-known binary), human-readable text-representations (WKT: well-known text) with LINESTRINGs, POLYGONs, MULTIPOLYGONs, GeoJSON and so on...
 - Expressions are usually hard to write with SQL, e.g. "Is the following polygon fully contained inside another polygon?" or "What is the distance (including the earth's curvature between the two points?". Therefore, GIS extensions usually provide custom functions for these use cases.
 - It's not always clear, what the reference system is: The earth is not a perfect sphere, so people came up with differt, highly precise reference systems. And there are thousands of standardized reference systems, see also https://en.wikipedia.org/wiki/Spatial_reference_system. For example, there might be one that is mapped solely over Nuremberg!

As you can see, it's not trivial. Therefore, we will work with it.


## Task 1: ElasticSearch / OpenSearch

For the first part, we want to index a data set that contains landmark information about New York with ElasticSearch. We have already prepared the data set for you and brought it into a nice JSON format. However, it should be noted that the OCR from the PDF files was not perfect, and therefore there are many typos in the description. But as we've learned earlier, ElasticSearch (ES) should be able to deal with that!

In this part of the exercise, you'll learn:
- How to use ES
- How to index data with ES
- How to query data with ES

Task:
 1. Download the data set containing landmark information from the course website.
 2. Connect to our instance of ElasticSearch.
 3. Create an index. Include your name in the index identifier to not cause collisions.
 4. Upload the landmark data set to your index.
 5. Create a function that accepts a list of keywords, searches for the keywords in `title` and `description`, and returns the name and area of the results.
 6. Run test queries against the indexed data.


Test queries:
 - Search for company buildings in the neo-classical architectural style (hint: we are looking for a bank)
 - Buildings where windows are set behind handsome ornamental iron grilles

Connection config:
```
host = 'dep-eng-data-s-heimgarten.hosts.utn.de'
port = 9200
auth = ('data-eng-elasticsearch', 'zTBix#e55:33')  # username and password
```

Note:
If you encounter the error
> /Users/kipf/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:34: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020

you can fix it by running `pip install urllib3==1.26.6`

In [1]:
!pip install opensearch-py

from opensearchpy import OpenSearch

print("Library imported successfully!")


Collecting opensearch-py
  Using cached opensearch_py-2.8.0-py3-none-any.whl.metadata (6.9 kB)
Collecting urllib3!=2.2.0,!=2.2.1,<3,>=1.26.19 (from opensearch-py)
  Using cached urllib3-2.2.3-py3-none-any.whl.metadata (6.5 kB)
Collecting requests<3.0.0,>=2.32.0 (from opensearch-py)
  Using cached requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting certifi>=2024.07.04 (from opensearch-py)
  Using cached certifi-2024.8.30-py3-none-any.whl.metadata (2.2 kB)
Collecting Events (from opensearch-py)
  Using cached Events-0.5-py3-none-any.whl.metadata (3.9 kB)
Collecting charset-normalizer<4,>=2 (from requests<3.0.0,>=2.32.0->opensearch-py)
  Using cached charset_normalizer-3.4.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (34 kB)
Collecting idna<4,>=2.5 (from requests<3.0.0,>=2.32.0->opensearch-py)
  Using cached idna-3.10-py3-none-any.whl.metadata (10 kB)
Using cached opensearch_py-2.8.0-py3-none-any.whl (353 kB)
Using cached certifi-2024.8.30-py3-none-any.w

In [33]:
from opensearchpy import OpenSearch

########
## You need to be in the university's VPN to access the server
########

host = 'dep-eng-data-s-heimgarten.hosts.utn.de'
port = 9200
auth = ('data-eng-elasticsearch', 'zTBix#e55:33')

client = OpenSearch(
    hosts = [{'host': host, 'port': port}],
    http_compress = True,
    http_auth = auth,
    use_ssl = True,
    verify_certs = False,
    ssl_assert_hostname = False,
    ssl_show_warn = False,
)
if client.ping():
    print("Successfully connected to Elasticsearch!")
else:
    print("Failed to connect to Elasticsearch.")

Successfully connected to Elasticsearch!


In [34]:
index_name = "landmarks_sayeeda"  # Replace "sayeed" with your name

# Define index settings and mappings (optional)
index_settings = {
    "mappings": {
        "properties": {
            "name": {"type": "text"},
            "description": {"type": "text"},
            "area": {"type": "geo_shape"},  # Assuming area is a geographic shape
            "designation_date": {"type": "date"},
        }
    }
}

# Create the index
response = client.indices.create(index=index_name, body=index_settings, ignore=400)
if response.get("acknowledged", False):
    print(f"Index '{index_name}' created successfully!")
else:
    print(f"Index '{index_name}' already exists.")


Index 'landmarks_sayeeda' already exists.


In [35]:
!pip install elasticsearch




In [44]:
import json

# Path to the JSON file
data_file_path = "/home/sayeedabegam/utn/sem1/Data_Eng/Asmt_4/Asmt_4/ny_landmarks-1.json"  # Replace with your actual file path

landmarks_data = []
with open(data_file_path, 'r') as file:
    for line in file:
        try:
            # Parse each line as a JSON object
            landmarks_data.append(json.loads(line.strip()))
        except json.JSONDecodeError as e:
            print(f"Error parsing line: {e}")

print(f"Successfully loaded {len(landmarks_data)} landmarks.")
# Index the data into Elasticsearch
for landmark in landmarks_data:
    try:
        response = client.index(index=index_name, body=landmark)
        print(f"Indexed landmark: {landmark['name']}")
    except Exception as e:
        print(f"Error indexing landmark {landmark.get('name', 'Unnamed')}: {e}")


Successfully loaded 1528 landmarks.
Indexed landmark: 105 Franklin Avenue House
Indexed landmark: Decker Farmhouse
Indexed landmark: Public School 15 (Daniel D. Tompkins School)
Indexed landmark: 121 Heberton Avenue House
Indexed landmark: Reverend David Moore House
Indexed landmark: 752 Delafield Avenue
Indexed landmark: Mary and David Burgher House
Indexed landmark: Staten Island Borough Hall
Indexed landmark: Nathaniel J. and Ann C. Wyeth House
Indexed landmark: 22 Pendleton Place House
Indexed landmark: 364 Van Duzer Street House
Indexed landmark: Staten Island Family Courthouse
Indexed landmark: John DeGroot House
Indexed landmark: Saint Andrew's Church
Indexed landmark: Stephen D. Barnes House
Indexed landmark: 66 Harvard Avenue House
Indexed landmark: John King Vanderbilt House
Indexed landmark: New Dorp Light, Expanded Site
Indexed landmark: 390 Van Duzer Street House
Indexed landmark: 411 Westervelt Avenue House, Horton's Row
Indexed landmark: H. H. Richardson House
Indexed la

In [45]:
def find_landmark(search_terms):
    query = {
        "query": {
            "bool": {
                "should": [
                    {"match": {"title": term}} for term in search_terms
                ] + [
                    {"match": {"description": term}} for term in search_terms
                ]
            }
        }
    }
    
    try:
        # Perform the search
        response = client.search(index=index_name, body=query)
        
        # Extract results
        results = []
        for hit in response['hits']['hits']:
            result = {
                'name': hit['_source'].get('name'),
                'area': hit['_source'].get('area')
            }
            results.append(result)
        
        return results
    except Exception as e:
        print(f"Error during search: {e}")
        return []
# Test queries
query1 = ["neo-classical bank"]  # Test query for company buildings in neo-classical style
query2 = ["iron grilles"]  # Test query for buildings with windows set behind ornamental iron grilles

# Execute and print test queries
print(json.dumps(find_landmark(query1), indent=2))
print(json.dumps(find_landmark(query2), indent=2))


[
  {
    "name": "American Bank Note Company Office Building",
    "area": "MULTIPOLYGON (((-74.01188537727334 40.704738501398595, -74.01188886515821 40.704888379476685, -74.01163039371968 40.70487440022729, -74.0116360918578 40.704741869567805, -74.01188537727334 40.704738501398595)))"
  },
  {
    "name": "Williamsburg Branch, Public National Bank of New York Building",
    "area": "MULTIPOLYGON (((-73.94301408986941 40.70307688581724, -73.94303829114857 40.70322526884037, -73.94266424366441 40.70325998966822, -73.94264021036126 40.70311261612427, -73.94301408986941 40.70307688581724)))"
  },
  {
    "name": "Staten Island Savings Bank Building",
    "area": "MULTIPOLYGON (((-74.07719689193364 40.62711616106281, -74.07722754700892 40.627284251372714, -74.07731610878672 40.627298452487, -74.0774071885792 40.62731305797228, -74.07749329274726 40.62732686617684, -74.07729186047735 40.62754146693617, -74.07662306615376 40.62717900680216, -74.07719689193364 40.62711616106281)))"
  },
  {

## Task 2: GIS

After we have successfully indexed the landmarks and retrieved their coordinates, we need to map them to the location IDs used in the NYTD data set. For this, we have already set up a Postgres instance with the PostGIS extension and the data set that contains the mapping information loaded into a table.
In this part of the exercise, you'll learn
 - How to connect to a remote Postgres instance
 - How to work with GEO data

Task:
 1. Connect to the Postgres instance. We've loaded the mapping data into the `nytd` database in `taxizone_mapping` table inside the `public` schema (if `public` schema is used, no explicit schema information has to be set when querying the table).
 2. Write a function that returns the location identifier of the area in New York where a certain geometric area is located. The multipolygon is represented in WKT (well-known-text) format and is passed as an argument to the function.

You can use the following multipolygon for testing:
```
'MULTIPOLYGON (((-73.96048024594971 40.805280360986465, -73.9600849906699 40.805114117054046, -73.96018339911724 40.804834550789785, -73.96075335968756 40.805074275249865, -73.96059295478436 40.80529668328362, -73.96062768938803 40.80531129223246, -73.96061047845258 40.805335135155126, -73.96048024594971 40.805280360986465)))'
```
The expected result is: `166, 'Morningside Heights'`

Note: The geometry definitions all use SRID=4326, which is the reference system defined as latitudes/longitudes over the earth's surface.

In [46]:
!pip install psycopg2-binary


Collecting psycopg2-binary
  Using cached psycopg2_binary-2.9.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Using cached psycopg2_binary-2.9.10-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB)
Installing collected packages: psycopg2-binary
Successfully installed psycopg2-binary-2.9.10


In [14]:
import psycopg2

# Establish connection to the PostgreSQL server (using 'postgres' as the default database)
conn = psycopg2.connect(
    user='data-eng-postgres',
    password='zTBix#e55:33',
    database='postgres',  # Use default 'postgres' database to list databases
    host='dep-eng-data-s-heimgarten.hosts.utn.de',
)

# Create a cursor object
cursor = conn.cursor()

try:
    # Query to list all databases
    cursor.execute("SELECT datname FROM pg_database;")
    
    # Fetch all the results
    databases = cursor.fetchall()

    print("Databases:")
    for db in databases:
        print(db[0])

except Exception as e:
    print(f"Error: {e}")

finally:
    # Close the cursor and connection
    cursor.close()
    conn.close()


import psycopg2

# Establish connection to the 'nytd' database
conn = psycopg2.connect(
    user='data-eng-postgres',
    password='zTBix#e55:33',
    database='nytd',  # Use the 'nytd' database
    host='dep-eng-data-s-heimgarten.hosts.utn.de',
)

# Create a cursor object
cursor = conn.cursor()

try:
    # Query the information schema for column names of the 'taxizone_mapping' table
    cursor.execute("""
        SELECT column_name
        FROM information_schema.columns
        WHERE table_name = 'taxizone_mapping'
        AND table_schema = 'public';
    """)

    # Fetch and print the results
    columns = cursor.fetchall()
    print("Columns in 'taxizone_mapping' table:")
    for column in columns:
        print(column[0])

except Exception as e:
    print(f"Error: {e}")

finally:
    # Close the cursor and connection
    cursor.close()
    conn.close()


Databases:
postgres
nytd
template1
template0
Columns in 'taxizone_mapping' table:
location_id
zone
geom


In [15]:
import psycopg2

# Establish connection to the database
conn = psycopg2.connect(
    user='data-eng-postgres',
    password='zTBix#e55:33',
    database='nytd',  # Use the 'nytd' database
    host='dep-eng-data-s-heimgarten.hosts.utn.de',
)

def find_area(conn, wkt):
    # Define the query to find the area for the provided WKT multipolygon
    query = """
    SELECT location_id, zone, geom
    FROM public.taxizone_mapping
    WHERE ST_Within(ST_GeomFromText(%s, 4326), geom);
    """
    
    # Create a cursor object
    cursor = conn.cursor()
    
    try:
        # Execute the query with the WKT as parameter
        cursor.execute(query, (wkt,))
        
        # Fetch the result
        result = cursor.fetchone()
        
        if result:
            # If a match is found, return the required fields
            print(f"Location ID: {result[0]}")
            print(f"Zone Name: {result[1]}")
            print(f"Geometry (geom): {result[2]}")
        else:
            print("No matching area found.")
    
    except Exception as e:
        print(f"Error during the query: {e}")
    
    finally:
        # Close the cursor
        cursor.close()

# Test the function with the provided multipolygon WKT
wkt = 'MULTIPOLYGON (((-73.96048024594971 40.805280360986465, ' \
      '-73.9600849906699 40.805114117054046, ' \
      '-73.96018339911724 40.804834550789785, ' \
      '-73.96075335968756 40.805074275249865, ' \
      '-73.96059295478436 40.80529668328362, ' \
      '-73.96062768938803 40.80531129223246, ' \
      '-73.96061047845258 40.805335135155126, ' \
      '-73.96048024594971 40.805280360986465)))'

find_area(conn, wkt)

# Close the connection when done
conn.close()


Location ID: 166
Zone Name: Morningside Heights
Geometry (geom): 0106000020E6100000010000000103000000010000006F0000003EB0E3BF407D52C05F99B7EA3A68444057091687337D52C0273108AC1C684440BB63B14D2A7D52C0081EDFDE35684440A182C30B227D52C0D7C39789226844400A2FC1A90F7D52C0F29881CAF867444002D369DD067D52C090D78349F16744404A5F0839EF7C52C0B8C83D5DDD674440B1DD3D40F77C52C0E8DEC325C767444048E00F3FFF7C52C0C9570229B1674440DBC1887D027D52C09DD7D825AA6744400890A163077D52C09EEE3CF19C6744409770E82D1E7D52C02F51BD35B067444082E7DEC3257D52C08DB7955E9B67444042B115342D7D52C03E78EDD2866744405CACA8C1347D52C04E417E36726744408EE9094B3C7D52C00002D6AA5D674440A8E49CD8437D52C0404F030649674440392A37514B7D52C09207228B346744406A6798DA527D52C0BB6070CD1D6744409F8EC70C547D52C011514CDE0067444076FBAC32537D52C0AEB9A3FFE56644402E1B9DF3537D52C0569C6A2DCC66444054C4E9245B7D52C0B4024356B76644409E4319AA627D52C0363FFED2A2664440172D40DB6A7D52C0417FA1478C664440EDB94C4D827D52C066F7E461A1664440F3380CE6AF7D52C046CD57C9C766444059DAA9B9DC7D52C00F7

## Task 3: A Basic ML Model

Let's try to predict how many taxis we will have on a certain date for a location id given the weather forecast. For the weather forecast, we have uploaded some simplified and aggregated data into the postgres instance that we also used for the location mapping. You can find the data in the `daily_weather` table inside the `public` schema of the `nytd` database. Further, we have uploaded an aggregated version of New York taxi data set into the `Yellow_Taxi_Trip_Data_2021_daily_rides` table in the same schema. The model doesn't need to be perfect.

Task:
 1. Connect to the Postgres instance and inspect the `Yellow_Taxi_Trip_Data_2021_daily_rides` and `daily_weather` table.
 2. Prepare the training data by joining the two tables.
 3. Split the data into a training and test part. You can do this randomly.
 4. Train a linear regression model that predicts the number of cabs in an area based on weather conditions.

In [51]:
pip install scikit-learn


Collecting scikit-learn
  Using cached scikit_learn-1.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting numpy>=1.19.5 (from scikit-learn)
  Downloading numpy-2.2.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.0/62.0 kB[0m [31m811.7 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hCollecting scipy>=1.6.0 (from scikit-learn)
  Using cached scipy-1.14.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting joblib>=1.2.0 (from scikit-learn)
  Using cached joblib-1.4.2-py3-none-any.whl.metadata (5.4 kB)
Collecting threadpoolctl>=3.1.0 (from scikit-learn)
  Using cached threadpoolctl-3.5.0-py3-none-any.whl.metadata (13 kB)
Using cached scikit_learn-1.6.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.1 MB)
Using cached joblib-1.4.2-py3-none-any.whl (301 kB)
Downloading numpy-2.2.0-cp312-cp312-manylinux

In [22]:
!pip install pandas
!pip install sqlalchemy


Collecting sqlalchemy
  Using cached SQLAlchemy-2.0.36-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.7 kB)
Collecting typing-extensions>=4.6.0 (from sqlalchemy)
  Using cached typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)
Collecting greenlet!=0.4.17 (from sqlalchemy)
  Using cached greenlet-3.1.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (3.8 kB)
Using cached SQLAlchemy-2.0.36-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.2 MB)
Using cached greenlet-3.1.1-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl (613 kB)
Using cached typing_extensions-4.12.2-py3-none-any.whl (37 kB)
Installing collected packages: typing-extensions, greenlet, sqlalchemy
Successfully installed greenlet-3.1.1 sqlalchemy-2.0.36 typing-extensions-4.12.2


In [35]:
import psycopg2
import pandas as pd

# Connect to the database
conn = psycopg2.connect(
    user='data-eng-postgres',
    password='zTBix#e55:33',
    database='nytd',
    host='dep-eng-data-s-heimgarten.hosts.utn.de',
)

# Query to fetch the first few rows from the table
query = "SELECT * FROM public.Yellow_Taxi_Trip_Data_2021_daily_rides LIMIT 10;"
query = "SELECT * FROM public.daily_weather LIMIT 10;"

df = pd.read_sql_query(query, conn)

# Print out column names
print("Column names:", df.columns)

# Close the connection
conn.close()


Column names: Index(['date', 'precipitation_in_mm', 'temperature_in_c'], dtype='object')


  df = pd.read_sql_query(query, conn)


In [40]:
import psycopg2
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from datetime import timedelta
from sqlalchemy import create_engine
import numpy as np

# Establish connection to the database
conn = psycopg2.connect(
    user='data-eng-postgres',
    password='zTBix#e55:33',
    database='nytd',
    host='dep-eng-data-s-heimgarten.hosts.utn.de',
)

# Query to join Yellow_Taxi_Trip_Data_2021_daily_rides with daily_weather
query = """
SELECT
    t.date,
    t.location_id,
    t.num_cabs,
    w.temperature_in_c AS temperature,
    w.precipitation_in_mm AS precipitation
FROM
    public.Yellow_Taxi_Trip_Data_2021_daily_rides t
JOIN
    public.daily_weather w
ON
    t.date = w.date
"""

# Read data into DataFrame
df = pd.read_sql_query(query, conn)

# Close the connection
conn.close()

# Check the first few rows of the data
print(df.head())

# Handle missing values if any
df = df.dropna()

# Prepare feature columns and target variable
X = df[['temperature', 'precipitation']]  # Features (weather conditions)
y = df['num_cabs']  # Target variable (number of cabs)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the Linear Regression model
def train_model(X_train, y_train):
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model

model = train_model(X_train, y_train)

# Evaluate the model using the test data
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)  # Compute MSE
rmse = np.sqrt(mse)  # Compute RMSE by taking the square root of MSE

print(f'Root Mean Squared Error (RMSE): {rmse}')

# Predict future values
def predict_future_values(model, input_data):
    predictions = model.predict(input_data)
    return predictions

# Example of how to use the model to predict future taxi rides
future_weather_data = pd.DataFrame({
    'temperature': [15.0, 20.0],  # Example temperatures
    'precipitation': [0.0, 0.1],  # Example precipitation values
})

future_predictions = predict_future_values(model, future_weather_data)
print("Predicted number of rides for future weather data:", future_predictions)


  df = pd.read_sql_query(query, conn)


         date  location_id  num_cabs  temperature  precipitation
0  2021-01-07          142      1502     2.592593            0.0
1  2021-01-07          142      1502     2.592593            0.0
2  2021-01-07          142      1502     2.592593            0.0
3  2021-01-07           68      1025     2.592593            0.0
4  2021-01-07           68      1025     2.592593            0.0
Root Mean Squared Error (RMSE): 851.160517936593
Predicted number of rides for future weather data: [372.65432477 401.03969599]


## Task 4: Bringing it all together

Now that we have the building blocks, write a function `predict_taxis(location_search_term: str, precipitation_in_mm: float, temperature_in_c: float)` where a user retrieve the predicted number of taxis close to a provided search string for a landmark and the weather forecast for a day!

Note about OpenSearch: In case you weren't able to upload the landmark documents to OpenSearch, we have created an index with everything uploaded for you that you can use intead. The name of the index is `ny_landmarks_teachers`.

In [46]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Assuming df is already loaded as the DataFrame
def train_model(df):
    # Handle missing values if any
    df = df.dropna()  # Drop rows with missing values
    
    # Prepare feature columns and target variable
    X = df[['temperature', 'precipitation']]  # Features (weather conditions)
    y = df['num_cabs']  # Target variable (number of cabs)
    
    # Split data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Train the Linear Regression model
    model = LinearRegression()
    model.fit(X_train, y_train)
    
    # Evaluate the model using the test data
    y_pred = model.predict(X_test)
    
    # Calculate the Mean Squared Error and take the square root for RMSE
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    
    print(f'Root Mean Squared Error (RMSE): {rmse}')
    
    return model, X_train, X_test, y_train, y_test

# Train the model with the data
model, X_train, X_test, y_train, y_test = train_model(df)

# Define the function to predict taxi rides based on weather and location
def predict_taxis(location_search_term, precipitation_in_mm, temperature_in_c):
    # Placeholder: Instead of searching OpenSearch, we assume we are given location details.
    # In a real implementation, we'd query OpenSearch for landmarks and get location_id
    # Example mock data (replace with actual OpenSearch data in real-world scenario):
    
    location_id = 123  # This would be fetched based on the search term
    
    # Predict the number of taxis for the given weather data
    prediction_data = pd.DataFrame({
        'temperature': [temperature_in_c],
        'precipitation': [precipitation_in_mm]
    })
    
    prediction = model.predict(prediction_data)
    print(f'Predicted number of taxis near {location_search_term}: {prediction[0]}')

# Example usage of the function
predict_taxis("Upper East Side", 10, 23.4)


Root Mean Squared Error (RMSE): 851.160517936593
Predicted number of taxis near Upper East Side: 411.0220343892516


## Feedback (voluntary)

How did you like this exercise? What could be improved?

Answer:

...

Further, I feel like:
 - [ ] the exercise was too easy
 - [ ] the exercise was too hard
 - [ ] the exercise was just right
 - [x] no answer
