# NLP-Based GitHub Language Predictor Project

- Predict the tax assessed property value of Zillow Single Family Residential properties with transaction dates in 2017.


# Goal

- Identify drivers of property value.
- Use drivers to develop a machine learning model to predict Single Family Residential property value.

# Imports

In [1]:
# local imports
# import acquire as a
# import prepare as p
# import explore as e
# import model as m

# Acquire

- Data acquired from GitHub's API 
- Data initially acquired on 19 Oct, 2023.
- It contained 677 rows and 3 columns before cleaning.
- Each row represents a unique repository for repositories starred more than 200 times and based on robotics projects.
- The columns (repos, readme) represented the label and text associated for each repo.

# Prepare

## Data Preprocessing Steps

### **Replacing Null/NaN Values:**
   - no null values were replaced.

###  **Column Renaming:**
   - renamed reame_contents column to shorten.

###  **Column Removal:**
   - Removed columns that did not contain useful information:
     - Columns with repo names
         - Repo names were used to retrieve text language and text data so it was no longer needed after retrieval.

### **Row Handling:**
   - All rows were revised and html tags, links, and other characters that were not useful were dropped.
   - Decreased feature space from 677 to 609 observation.

### **Data Type Validation:**
   - Checked and ensured that column data types were appropriate.

### **Feature Engineering:**
   - Bigrams, Trigrams as new features

### **Data Splitting:**
   - Split the data into three sets for modeling: training, validation, and test, with an approximate split of 70/15/15.

In [2]:
# acquire, clean, and prepare the data
# df = wrangle_zillow()

# split into train, validate, and test datasets
# train, val, test = train_val_test(df)

### Quick view of data,

In [3]:
# First 5 rows of data
# train.head()

### Distribution of Zillow Data

In [4]:
# function for data distribution based on target
# dist_of_data(df, 'home_value')

# Explore

* ?
* ?
* ?  
* ?  

## Question 1. - **?**

In [5]:
# analysis 1 visualization
# analysis_1(train)

### **Takeaway:**  
### bedrooms clearly have a relationship that appears linear. This will be a good feature to use to predict single family residential properties in the modeling phase.

## Question 2. - **?**

In [6]:
# analysis 2 visualization
# analysis_2(train)

### **Takeaway:**  
### Home value and year built by decade show a linear relationship. This will be a good feature to use to predict single family residential properties in the modeling phase.

# Question 3. - **?**  

- **Null Hypothesis($H0$):** There is **a significant correlation** between 

- **Alternative Hypothesis($Ha$):** There **is no significant correlation** between 

In [7]:
# Analysis 3 visualization
# analysis_3(train)

### **Takeaway:** 
### The stats test Rejected the Null Hypothesis and we can draw our conclusions that there is a strong linear correlation between property area and home value

# Question 4 - **?**

- **Null Hypothesis($H0$):** There is **a significant correlation** between the 

- **Alternative Hypothesis($Ha$):** There **is no significant correlation** between 

In [8]:
# analysis 4 visualization
# analysis_4(train)

### **Takeaway:** 
### The stats test Failed to Rejected the Null Hypothesis and we can draw our conclusions that there is a positive correlation between lot area and home value. it may need to be feature engineered to use as a feature.

# Explore Summary

- 

# Features for modeling

### **Features that will be used:**
* *Language*(target)  

### **Some features that may be useful:**  
* *text*
* *top_bigrams*
* *top_trigrams*


# Modeling

### Evaluation Metric
- The primary evaluation metrics for the models are:
  - Accuracy, Precision, Recall Metric

### Data Expansion
- Given more time, additional data could be gathered to improve model performance.
- Experimentation with hyperparameter configurations can lead to better significance and lower RMSE.

### Target Feature
- The target feature for the models is `Language`.

### Baseline accuracy
- The baseline accuracy and f1 score.

### Models and Hyperparameters
- Two different sets of hyperparameters will be used for model development, each optimized for different features:
  1. Logistic Regression (default)
  2. RandomForestClassifier (default)
  3. XGBoostClassifier ()

### Model Evaluation
- Models will be developed and evaluated on the training and validation datasets.

### Final Model Selection
- The best-performing model, as determined by RMSE and Model Significance on the validation set, will be selected.

### Test Dataset
- The selected best model will be evaluated on the test dataset for final performance assessment.ized and easier to understand.


## Baseline Model Root Mean Squared Error

In [9]:
# baseline and model rmse and residual plot visual
# wrangle_zillow_and_train_model()

## Best of 3 model configurations

###  selectkbest used for feature importance selection

In [10]:
# selected_features = select_k_features(train)
# selected_features.to_list()

### Used in model Evaluation and they did not offer promising results.

In [11]:
# X_train, y_train, X_val, y_val, X_test, y_test = model_xy()

## Model 1

In [12]:
### RandomForestRegression Model - The RandomForestRegression model showed signs of overfitting, as it performed well on the training data but had a large margin of error.
# model1_results = model_1(X_train, y_train, X_val, y_val)

## Model 2

In [13]:
### XGBRegressor Model - hyperparameter tuning (combination of early round stopping and eval set)
# model2_results = model_2(train, 'home_value', X_val, y_val, params=None)

## Model 3

In [14]:
### Polynomial Features LinearRegression Model - The Polynomial Features LinearRegression model exhibited better generalization, but it had poor R2 scores.
# model_3 = model_3()

## Best on Test

#### XGboost, XGBRegressor algorithm combination of early stopping rounds and eval set.

In [15]:
# final assessment on test data using XGBRegressor model
# final_results = final_model(train, 'home_value', X_test, y_test, params=None)

# Model Summary

### RandomForestRegression Model
- The RandomForestRegression model showed signs of overfitting, as it performed well on the training data but had a large margin of error.
  
### Polynomial Features LinearRegression Model
- The Polynomial Features LinearRegression model exhibited better generalization, but it had poor R2 scores.

### XGBRegressor Model
- The XGBRegressor model, with hyperparameter tuning, performed the best with the following metrics:

  - RMSE: $325,000
  - R2 Score: 0.81

# Conclusion

## Takeaways and Key Findings

- Home details such as the calculated area of the home, lot area, number of bedrooms, number of bathrooms, and year built are significant drivers of home value.
- Calculated area is the most important feature across all factors.
- Higher-priced and larger single residential family properties appear to be concentrated in Ventura County, while smaller homes are spread across Orange and Los Angeles counties.
- Larger living areas correlate with higher property values.
- Location plays a crucial role in property value.


## Model Improvement
- The model still requires further improvement.

# Recommendations and Next Steps

- If the data contained detailed of hard appliances attached or amenities homebuyers seek in single family residential properties maybe this could affect property value.

- Given more time, the following actions could be considered:
  - Gather more data to improve model performance.
  - Feature engineer new variables to enhance model understanding.
      - trasnaction dates for value over time
  - Fine-tune model parameters for better performance.