Skip to content

Machine Learning project using PySpark, for classification and recommendation model of Real Estate.

Notifications You must be signed in to change notification settings

Lacerdash/Machine-Learning-with-PySpark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Alura Data Science Challenge II

In this Repository is my project developed during the participation in the second Data Science Challenge promoted by Alura. Alura is a Brazilian online platform for technology courses.

The purpose of these Challenges is to deepen students' knowledge in the area of ​​Data Science, through practical challenges. More specific in this case we will use PySpark to make all the ETL process, build a regression model to price real estate and create a real estate recommender.

🪧 Vitrine.Dev
✨ Project Name Alura Data Science Challenge II
🏷️ Tecnologias Python
🚀 Libraries PySpark, zipfile, Seaborn and pyplot
🔥 Challenge https://www.alura.com.br/challenges/data-science-2/

About the Challenge

Objectivies: Build a regression model to price real estate and create a real estate recommender.

Data: The data is avaible here and the data dictionary here

Structure: The challenge is divided in 3 parts: ETL (Extract, Transform and load), Creating Regression Models and


1 - Extract, Transform and Load

The first part of the project is dedicated to the ETL process of the data. Extracting the data in json format into python, for subsequent transformation/cleaning of the data, followed by loading the data into in csv and parquet file format. All of these activies were performed using the PySpark library.

At this stage, the translation of the data into English was also carried out.

CSV file. Parquet file.

All activities performed are documented in this notebook.


2 - Regression Models

The second part of the project is dedicated to:

  • Process the data from the first notebook "1 - Extract, Transform and Load" to use Regression Models.
    • Treating null and NaN data;
    • Treating missing data in the zone columns
    • Transforming categorical columns into binary columns (0, 1)
    • Removing useless columns
    • Saving the DataFrame in a parquet file
  • Creating Models
    • Vectorizing the data (Vector Assembler)
    • Creating 4 models (Linear Regression, Decision Tree Regressor, Random Forest Regressor and Gradient-boosted tree Regressor)
  • Optimizing the best model
    • Cross Validation and Hyperparameters Testing

Parquet file generated.

All activities performed are documented in this notebook.

Semana 3

Working ...

insignhtPlaces

About

Machine Learning project using PySpark, for classification and recommendation model of Real Estate.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published