Karla Jacobo  
DSCI 8950  
ML Applications to Project

To start off, I am going to import the datasets that were a result of the individual EDA assignment from earlier in the semester. I will be doing some additional transforming of the data to make it more suitable for the Machine Learning models in this assignment. 

As we can see below, I am bringing data together from the Airport and Routing tables. This will give us a better idea of the points where the routes are going to and coming from. In order to keep the scope within reason, the data for this assignment is limited to the United States. Limiting the data to within the United States was applied to three tables: routes, airports, and planes.

In [None]:
from itertools import count
import pandas as pd
import plotly.graph_objects as go
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from scipy.optimize import linprog
import numpy as np

airline = pd.read_csv('../../data/openFlightsRaw/airline.csv')
airports = pd.read_csv('../../data/openFlightsRaw/airports.csv')
countries = pd.read_csv('../../data/openFlightsRaw/countries.csv')
planes = pd.read_csv('../../data/openFlightsRaw/planes.csv')
routes = pd.read_csv('../../data/openFlightsRaw/routes.csv')
cleanRoutes = pd.read_csv('../../data/usRoutesClean.csv')

usAirports = airports.where(airports["CTRY"] == "United States")
usAirports = usAirports.drop_duplicates(subset=["IATA"])

usRoutes = routes.where(routes["SRC_AIRPT"].isin(usAirports["IATA"]) & routes["DESTN_AIRPT"].isin(usAirports["IATA"]))

usRoutes["SRC_NAME"] = (pd.merge(usRoutes, usAirports, left_on="SRC_AIRPT", right_on="IATA"))["NAME"]
usRoutes["SRC_LAT"] = (pd.merge(usRoutes, usAirports, left_on="SRC_AIRPT", right_on="IATA"))["LAT"]
usRoutes["SRC_LONG"] = (pd.merge(usRoutes, usAirports, left_on="SRC_AIRPT", right_on="IATA"))["LONG"]
usRoutes["DESTN_NAME"] = (pd.merge(usRoutes, usAirports, left_on="DESTN_AIRPT", right_on="IATA"))["NAME"]
usRoutes["DESTN_LAT"] = (pd.merge(usRoutes, usAirports, left_on="DESTN_AIRPT", right_on="IATA"))["LAT"]
usRoutes["DESTN_LONG"] = (pd.merge(usRoutes, usAirports, left_on="DESTN_AIRPT", right_on="IATA"))["LONG"]
usRoutes["EQPT"] = usRoutes["EQPT"].str.split(" ")
usRoutes["EQPT_COUNT"] =  usRoutes["EQPT"].str.len()

usRoutes = usRoutes.dropna()
usAirports = usAirports.dropna()
usRoutes.to_csv("test")

usPlanes = planes.where(planes["IATA"].isin(usAirports["IATA"]) & (planes["IATA"] != '\\N'))
usPlanes = usPlanes.dropna()

#### **Machine Learning Model #1: Decision Tree Classification and Regression**  

For the first model, I will be using a Decision tree model classifier. This model will take in the latitude and longitude data for the routes as our dependent variables. I will use the number vehicle possibilities for our independent variable. I chose to use this variable because if a route is able to use more than one type of vehicle, that makes it more flexible. This would make it easier to find an alternative for it if one vehicle is not available. The source and destination latitude and longitudes were factored in because I wanted to see if there is a correlation between the starting and end points and the vehicles available to them.

Because the data was already cleaned and organized into corresponding tables, we can move right into the train_test_split() for the decision tree. For the testing size of the data, I will start with a size of 20%. The model itself is fairly simple to set up. I will start without any parameters in the DecisionTreeClassifier();

In [23]:
columns = ["SRC_LAT","SRC_LAT", "DESTN_LAT", "DESTN_LONG"]
dependentVariables = usRoutes[columns]
independentVariables = usRoutes["EQPT_COUNT"]

X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
classifier = DecisionTreeClassifier()
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Accuracy score (no parameters):",metrics.accuracy_score(y_test, y_pred))

Accuracy score (no parameters): 0.5978378378378378


Without tuning any parameters besides the size of the test data, the model resulted in an accuracy score of about 0.6 (1 is a perfect score). To see if a higher score can be acheived, I will tune the following parameters: criterion, splitter, and min_samples_leaf. 

In [24]:
classifier = DecisionTreeClassifier(criterion="gini")
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Accuracy score (criterion: gini):",metrics.accuracy_score(y_test, y_pred))

classifier = DecisionTreeClassifier(criterion="entropy")
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Accuracy score (criterion: entropy):",metrics.accuracy_score(y_test, y_pred))

classifier = DecisionTreeClassifier(criterion="log_loss")
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Accuracy score (criterion: log_loss):",metrics.accuracy_score(y_test, y_pred))

classifier = DecisionTreeClassifier(splitter="random")
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Accuracy score (criterion: gini, splitter: random):",metrics.accuracy_score(y_test, y_pred))

classifier = DecisionTreeClassifier(min_samples_leaf = 5)
classifier = classifier.fit(X_train,y_train)
y_pred = classifier.predict(X_test)

print("Accuracy score (criterion: gini, min_samples_leaf: 5):",metrics.root_mean_squared_error)

Accuracy score (criterion: gini): 0.5935135135135136
Accuracy score (criterion: entropy): 0.5913513513513513
Accuracy score (criterion: log_loss): 0.5924324324324325
Accuracy score (criterion: gini, splitter: random): 0.6086486486486486
Accuracy score (criterion: gini, min_samples_leaf: 5): <function root_mean_squared_error at 0x000001FFAA40E7A0>


Through the tuning of the parameters, the one that made a significant change was the minimum sample leaf number. The default to that variable is 2 and it indicated the "minimum number of samples required to split an internal node" (scikit-learn). The other two variables that were tuned did not make a significant change. Because of that, I stuck with the default parameters for criterion and splitter, which are "gini" and "best".  

I will be applying a Decision Tree regression to the same dataset. By looking at the accuracy score for both, we can potentially see which model is better for the problem at hand.

In [25]:
X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor()
regression = regression.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print("Accuracy score:",metrics.accuracy_score(y_test, y_pred))

Accuracy score: 0.6972972972972973


Without supplying any parameters, the model is giving an accuracy score of about 0.84 (1 is a perfect score). I will tune the same parameters from the classification model to see how that will affect the accuracy score.

In [26]:
X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor(criterion="friedman_mse")
regression = regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
print("Accuracy score (criterion: friedman_mse):",)

X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor(criterion="absolute_error")
regression = regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
print("Accuracy score (criterion: absolute_error):", )

X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor(criterion="poisson")
regression = regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
print("Accuracy score (criterion: poisson):", )

X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor(criterion="poisson")
regression = regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
print("Accuracy score (criterion: poisson):", )

X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor(splitter="random")
regression = regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
print("Accuracy score (splitter: random):", metrics.accuracy_score(y_test, y_pred))

X_train, X_test, y_train, y_test = train_test_split(dependentVariables, independentVariables, test_size=0.2)
regression = DecisionTreeRegressor(splitter="random")
regression = regression.fit(X_train, y_train)
y_pred = regression.predict(X_test)
print("Accuracy score (min_samples_leaf: 5):", metrics.mean_squared_error(y_test, y_pred))

Accuracy score (criterion: friedman_mse):
Accuracy score (criterion: absolute_error):
Accuracy score (criterion: poisson):
Accuracy score (criterion: poisson):


ValueError: Classification metrics can't handle a mix of multiclass and continuous targets

After tuning the same parameters as the classification decision tree model, this model still seems to have a higher accuracy score. The parameter that has the largest impact on the scores was the criterion parameter. In both model types, the criterion parameter is a, "function to measure the quality of a split" (scikit-learn). The splitter and min_samples_leaf parameters did not change the results significantly.

The model being used should be based more on the problem at hand rather than the accuracy score given. Because of that, I would most likely choose to use the Regression decision tree model to show that direct relationship between variables.

#### **Machine Learning Model #3: Isolation Forest - Finding Anomalies in Altitude**

Now that the data has been further refined to routes and airports in the United States, we can apply some anomaly detection to some of the locations. More specifically, I intend to use it on the altitude for airports in the dataset that we have. This will help find outliers in airports that are at a particularly high or low altitude. To so so, I plan to use the Isolation Forest model.

I am going to start of the Isolation forst model without any hyper parameters. Below we can see that in doing that, the model identified 1493 outliers out of the 7698 rows it was given.

In [None]:
altitudeData = airports[["ALT"]]
isoForestAnalysis = altitudeData.copy()

isolation_forest = IsolationForest()
isolation_forest.fit(altitudeData)
isoForestAnalysis['decisionFunction'] = isolation_forest.decision_function(altitudeData)
isoForestAnalysis['predictFunction'] = isolation_forest.predict(altitudeData)
isoForestAnalysis['scoreSamples'] = isolation_forest.score_samples(altitudeData)

outlierCount = isoForestAnalysis['predictFunction'].where(isoForestAnalysis['predictFunction'] == -1).value_counts()

print(f"Number of outliers detected: {outlierCount.values} out of {isoForestAnalysis.shape[0]} entries \n")
print(isoForestAnalysis.head())

Number of outliers detected: [1730] out of 7698 entries 

    ALT  decisionFunction  predictFunction  scoreSamples
0  5282         -0.129398               -1     -0.629398
1    20          0.097575                1     -0.402425
2  5388         -0.126663               -1     -0.626663
3   239          0.060153                1     -0.439847
4   146          0.063746                1     -0.436254


One thing to notice is that this first iteration of the model is not using default amount of sample data. In the Isolation Forest model, the size of the sample data is defined as a parameter given to the method. In the next iteration, I will be adding the max_samples parameter to specify the sample size for training. In addition to that parameter, I will slowly be adding other parameters in like the number of estimators, contamination, and random state to see how it will affect the output of the model. Most importantly though, I will need to find the balance between the parameters in order not to overfit the model.

In [None]:
isolation_forest = IsolationForest(n_estimators=300, max_samples=0.2)
isolation_forest.fit(altitudeData)
isoForestAnalysis['decisionFunction'] = isolation_forest.decision_function(altitudeData)
isoForestAnalysis['predictFunction'] = isolation_forest.predict(altitudeData)
isoForestAnalysis['scoreSamples'] = isolation_forest.score_samples(altitudeData)

outlierCount = isoForestAnalysis['predictFunction'].where(isoForestAnalysis['predictFunction'] == -1).value_counts()

print(f"Number of outliers detected: {outlierCount.values} out of {isoForestAnalysis.shape[0]} entries \n")
print(isoForestAnalysis.head())

Number of outliers detected: [1309] out of 7698 entries 

    ALT  decisionFunction  predictFunction  scoreSamples
0  5282         -0.061941               -1     -0.561941
1    20          0.106139                1     -0.393861
2  5388         -0.067257               -1     -0.567257
3   239          0.058377                1     -0.441623
4   146          0.064798                1     -0.435202


In the block of code above, we can see that by adding the number of estimators and setting it to 300 (the default number being 100) and the amount of sample data to 20% of 7698 rows of data, the number of outliers that were detected went down to 1404 instances. This is likely because in the original instance, there was no sample size specified. When that happens, the Isolation Tree model defaults to 256 rows of data. By giving it 20% of our dataset (about 1540 rows), we have given it more data to train on. In doing so, the model likely picked up less false positives this time around.

In the last set of parameter tuning, I will be adding the contamination and random_state parameters to the model.

In [None]:
isolation_forest = IsolationForest(n_estimators=300, max_samples=0.2, contamination=0.1, random_state=42)
isolation_forest.fit(altitudeData)
isoForestAnalysis['decisionFunction'] = isolation_forest.decision_function(altitudeData)
isoForestAnalysis['predictFunction'] = isolation_forest.predict(altitudeData)
isoForestAnalysis['scoreSamples'] = isolation_forest.score_samples(altitudeData)

outlierCount = isoForestAnalysis['predictFunction'].where(isoForestAnalysis['predictFunction'] == -1).value_counts()

print(f"Number of outliers detected: {outlierCount.values} out of {isoForestAnalysis.shape[0]} entries \n")
print(isoForestAnalysis.head())

Number of outliers detected: [769] out of 7698 entries 

    ALT  decisionFunction  predictFunction  scoreSamples
0  5282         -0.033310               -1     -0.567873
1    20          0.137939                1     -0.396624
2  5388         -0.041863               -1     -0.576426
3   239          0.093762                1     -0.440801
4   146          0.104737                1     -0.429826


In adding the contamination and random_state parameters, we can see that the number of outliers detected was significantly reduced. This has nothing to do with the random state parameter. That parameter just keeps the ouput from changing as much when the code is run. The number of outliers was reduced because the contamination level sets the proportion for expected outliers in the dataset. By setting it to 10% we are significantly restricting the count of outliers. I will be removing the random_state just in case that affects the content of the sample being given to the model. This way, the model will be given varying sample data throughout its iterations.

In the block below, I am going to run the model once more with my prefered paramaters. In addition to that, I will supply some visuals to support the analysis.

In [None]:
isolation_forest = IsolationForest(n_estimators=300, max_samples=0.2,contamination=0.02)
isolation_forest.fit(altitudeData)
isoForestAnalysis['decisionFunction'] = isolation_forest.decision_function(altitudeData)
isoForestAnalysis['predictFunction'] = isolation_forest.predict(altitudeData)
isoForestAnalysis['scoreSamples'] = isolation_forest.score_samples(altitudeData)

outlierCount = isoForestAnalysis['predictFunction'].where(isoForestAnalysis['predictFunction'] == -1).value_counts()

print(f"Number of outliers detected: {outlierCount.values} out of {isoForestAnalysis.shape[0]} entries \n")
print(isoForestAnalysis.head())

print('------------------------------------------------------------------------ \n')


airports['predictFunction'] = isoForestAnalysis['predictFunction']
outliers = airports.where(airports['predictFunction'] == -1)
outliers = outliers.dropna()

fig = go.Figure(data=go.Scattergeo(
        locationmode = 'USA-states',
        lon = outliers['LONG'],
        lat = outliers['LAT'],
        text = outliers['NAME'],
        mode = 'markers',
        marker_color = outliers['ALT'],
        marker = dict(
            autocolorscale = False,
            colorbar = dict(
                title=dict(
                    text="Airport Altitude"
                )
            ))))

fig.update_layout(
        title = 'Airports with Altitude Outliers)',
        geo_scope='usa',
    )

fig.show()

Number of outliers detected: [154] out of 7698 entries 

    ALT  decisionFunction  predictFunction  scoreSamples
0  5282          0.043379                1     -0.570297
1    20          0.221475                1     -0.392202
2  5388          0.037944                1     -0.575733
3   239          0.173953                1     -0.439724
4   146          0.174076                1     -0.439601
------------------------------------------------------------------------ 



In this chart, we can see 2% of the detected outliers plotted on the map. I limited it to 2.5% in order to highlight some of the outliers like Telluride airport and others in the Rocky Mountains area. Those are on the upper-bound for Altitude. The ones in Southern California represent the lower bound.

#### Using SciPy: Linear Programming

Basic terminologies of Linear Programming

Objective Function: The main aim of the problem, either to maximize of to minimize, is the objective function of linear programming. In the problem shown below, Z (to minimize) is the objective function.  
  
Decision Variables: The variables used to decide the output as decision variables. They are the unknowns of the mathematical programming model. In the below problem, we are to determine the value of x and y in order to minimize Z. Here, x and y are the decision variables.  
  
Constraints: These are the restrictions on the decision variables. The limitations on the decision variables given under subject to the constraints in the below problem are the constraints of the Linear programming.  
  
Non – negativity restrictions: In linear programming, the values for decision variables are always greater than or equal to 0.

(https://www.geeksforgeeks.org/python-linear-programming-in-pulp/)

# Sample code pulled from internet below

In [64]:
distDict = [
    [0, 500, 75, 50],
    [500, 0, 280, 800],
    [75, 280, 0, 30],
    [50, 800, 30, 0]
    ]

c = [0, 500, 75, 50]
# Define inequality constraints (A_ub @ x <= b_ub)

# Define bounds for x and y (0 <= x <= None, 0 <= y <= None)
bounds = [0, 800]

# Solve the linear programming problem
result = linprog(c,bounds=bounds)

print(result)


       message: Optimization terminated successfully. (HiGHS Status 7: Optimal)
       success: True
        status: 0
           fun: 0.0
             x: [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
           nit: 0
         lower:  residual: [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
                marginals: [ 0.000e+00  5.000e+02  7.500e+01  5.000e+01]
         upper:  residual: [ 8.000e+02  8.000e+02  8.000e+02  8.000e+02]
                marginals: [ 0.000e+00  0.000e+00  0.000e+00  0.000e+00]
         eqlin:  residual: []
                marginals: []
       ineqlin:  residual: []
                marginals: []


### Sources Cited

**Decision Tree**  
https://www.datacamp.com/tutorial/decision-tree-classification-python  
https://scikit-learn.org/stable/modules/tree.html#classification  
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html  
https://www.geeksforgeeks.org/python-decision-tree-regression-using-sklearn/  
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html  

**Isolation Forest**  
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html  
https://plotly.com/python/scatter-plots-on-maps/  
https://plotly.com/python/scatter-plots-on-maps/  
