This Repository consist of code and documentation needed for successfully running the project End to End.
Below are the steps needed to be installed before running this project :
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
echo 'eval "$(/opt/homebrew/bin/brew shellenv)"' >> /Users/admin/.zprofile
eval "$(/opt/homebrew/bin/brew shellenv)"
brew install openjdk@11
brew install scala
brew install python
brew install apache-spark
https://spark.apache.org/downloads.html
https://docs.anaconda.com/anaconda/install/index.html
conda install pandas
pip install -U scikit-learn
Please refer website : https://pytorch.org/get-started/locally/
https://dev.mysql.com/downloads/installer/
https://jar-download.com/artifacts/mysql/mysql-connector-java/5.1.48/source-code
https://www.kaggle.com/datasets/giovamata/airlinedelaycauses
.config("spark.driver.extraClassPath","C:/Users/AnshumaanChauhan/Documents/spark-3.3.0-bin-hadoop3/spark-3.3.0-bin-hadoop3/jars/mysql-connector-java-5.1.48.jar")
Here we need to change specified in this config attribute path to the Path in the system
dataset = spark.read.csv('C:\\Users\AnshumaanChauhan\\Documents\\Systems for DS Umass\\Project\\archive (5)\\DelayedFlights.csv',
header=True)
Change the path specified in the load instruction to the path where dataset is stored in the system
updated_dataset.select(*(col(c) for c in dataset.columns)).write.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/Sys") \
.option("driver", "com.mysql.jdbc.Driver").option("dbtable", "dataset") \
.option("user", "root").option("password", "MySQL").save()
updated_dataset = spark.read.format("jdbc") \
.option("url", "jdbc:mysql://localhost:3306/Sys") \
.option("driver", "com.mysql.jdbc.Driver").option("dbtable", "dataset") \
.option("user", "root").option("password", "MySQL").load()
In these statements change the value of "user" and "password" to the values specified during initializing of MySQL on the system
- MySQLQueries.sql : Constits of the MySQL Analysis
- SparkSQL_Queries_and_Python.py : Code for PySpark analysis and visualizations using Matplotlib
- models_for_delay_prediction.py : Python file consisting Machine Learning component of the Project
- Scalability Check for Machine Learning System Predciting Flight Delays - Final Report : Project report created in MLSys 2022 format