Skip to content

Commit

Permalink
some format changes
Browse files Browse the repository at this point in the history
  • Loading branch information
zhou1489 committed Jun 20, 2024
1 parent fdd7b95 commit 4175c5f
Show file tree
Hide file tree
Showing 4 changed files with 83 additions and 61 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,7 @@ In this project, we will understand the concept of how to identify and handle ou

== Questions

[NOTE]
====
=== Question 1 (2 points)

Outliers are values that are significantly different from other values, which can cause inaccurate results for data analysis.

Expand Down Expand Up @@ -63,13 +62,15 @@ abs_z_scores = np.abs(z_scores)
outliers = my_df[abs_z_scores > 3]
print("Outliers detected:\n", outliers)
----
====

=== Question 1 (2 points)

.. Please use your own words to explain the example codes above.
.. Please identify outliers in the 'SalePrice' column using `Z-score`.


=== Question 2 (2 points)


The following example code uses the Interquartile Range (IQR) method:

[source,python]
Expand All @@ -89,10 +90,12 @@ upper_bound = Q3 + 1.5 * IQR
my_df['Value'] = my_df['Value'].clip(lower_bound, upper_bound)
----

=== Question 2 (2 points)

.. Please replace outliers in the 'SalePrice' column of the train.csv file using the IQR method.



=== Question 3 (2 points)

The following code is used to find out how many missing values are present:

[source,python]
Expand All @@ -101,26 +104,26 @@ missing_v = my_df_2.isnull().sum()
print("Missing values in each column:\n", missing_v)
----

=== Question 3 (2 points)

.. Please identify how many columns have missing values in the train.csv file.


=== Question 4 (2 points)

Imputation means to replace the missing data with some specified values, like the following code to replace with the mean value:

[source,python]
----
my_df_2['Value'].fillna(my_df_2['Value'].mean(), inplace=True)
----

=== Question 4 (2 points)

.. Please replace missing values in the 'LotFrontage' column in train.csv using the `mean` value.

Missing data can significantly impact the results of model training.


=== Question 5 (2 points)


Missing data can significantly impact the results of model training.

.. Explain the importance of handling missing values in a dataset. Provide an example to support your statement


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Classes, also called labels or categories, are outputs based on input features.

=== Question 2 (2 points)

Next, let us Pre-process the Data and split the dataset into training and testing sets.
Next, let us pre-process the data and split the dataset into training and testing sets.

The training set is the data used to train the model, and the testing set is used to evaluate the model's performance after training.

Expand All @@ -63,7 +63,7 @@ X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_
----

.. Please explain the meaning of the following code, and what the values of X and y will be
.. Run the example code, and explain what the below lines are doing along with what the values of X and y will be

[source,python]
----
Expand All @@ -84,19 +84,24 @@ X_test = scaler.transform(X_test)

.. Please state why we need to standardize data in general and note the differences between the original data and the standardized data using the code provided above.

[TIP]
====
Please read https://medium.com/analytics-vidhya/why-scaling-is-important-in-machine-learning-aee5781d161a[this article] to learn about scaling, standardization and normalization
====

=== Question 4 (2 points)

The next step is to `Initialize` and `Train` the KNN Classifier.
The next step is to 'initialize' and 'train' the KNN Classifier.

The parameter `k` in K-Nearest Neighbors (KNN) is the number of nearest neighbors. Changing `k` will impact the model's performance:

- k = 1: one neighbor, highly flexible (low bias) but can have high variance.
- k = 5: five neighbors, less sensitive to noise compared to k = 1.
- k = 10: ten neighbors, even less sensitive with lower variance.

https://blog.dataiku.com/bias-and-noise-in-machine-learning[This article] will give you more idea what are bias, variance or noise
https://blog.dataiku.com/bias-and-noise-in-machine-learning[This article] describes the concepts of bias, variance and noise.

For small clean datasets like iris.csv, you possibly get perfect accuracy for different k values.
For small, clean datasets like `iris.csv`, you could potentially get perfect accuracy for different k values.

[source, python]
----
Expand All @@ -114,14 +119,14 @@ knn.fit(X_train, y_train)

=== Question 5 (2 points)

We can then make a prediction using the following sample code:
You can make a prediction using `predict` method, like the following code to predict the first instance in the testing set

[source,python]
----
predicted_class = knn.predict([X_test[0]])
----

.. What is the predicted class for the first instance in the testing set using the KNN classifier with k=3?
.. What is the predicted class for the 20th instance in the testing set using the KNN classifier with k=3?

Project 02 Assignment Checklist
====
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
= 301 Project 03 - K-Nearest Neighbor Model Introduction II
= 301 Project 03 - K-Nearest Neighbors II

== Project Objectives

In this project, we will continue to understand the foundational knowledge of the K-Nearest Neighbors (KNN) algorithm using a regression approach.
In this project, we will continue to understand the foundational ideas of the K-Nearest Neighbors (KNN) algorithm using a regressive approach.

== Reading and Resources

Expand All @@ -13,6 +13,16 @@ In this project, we will continue to understand the foundational knowledge of th

`/anvil/projects/tdm/data/boston.csv`


== Questions

[NOTE]
====
This project will largely be based on concepts covered in the previous two projects. As a result, much of the code will be more on you to write. If you are struggling with these questions, feel free to refer back to the examples provided in Projects 1 and 2 for more substantive starting code to build off of.
====

=== Question 1 (2 points)

This dataset contains various features of houses in Boston. The target variable is the median value of owner-occupied homes.

Let us load the data to a DataFrame
Expand All @@ -24,17 +34,17 @@ import pandas as pd
my_df = pd.read_csv('/anvil/projects/tdm/data/boston.csv')
----

== Questions

=== Question 1 (2 points)

.. What are the mean, median, and standard deviation of the median value (target variable)?
.. What are the mean, median, and standard deviation of the median price of owner-occupied homes(our target variable)?


Next we will need to preprocess the data
[TIP]
====
If you need a reminder on how to do this, take a glance at https://www.geeksforgeeks.org/create-the-mean-and-standard-deviation-of-the-data-of-a-pandas-series/[this article].
====

== Question 2 (2 points)

Next we will need to pre-process the data

.. Complete the following code to get the features and target

[source,python]
Expand All @@ -49,46 +59,45 @@ y = # your code here

=== Question 3 (2 points)

Next we will split the dataset to training and testing
Next we will split the dataset to training and testing sets

.. Complete the following code to split the dataset using appropriate function and parameters
.. Complete the following code to split the dataset using the appropriate function and parameters

[source,python]
----
X_train, X_test, y_train, y_test = # your code here
----

Then let us standardize the features

=== Question 4 (2 points)

Then let's standardize the features

.. Please complete the following code
.. Please explain why we need to standardize the features
.. Please explain the motivation behind standardizing the features

[source,python]
----
scaler = StandardScaler()
X_train = # Your code here #
X_test = # Your code here #
----


Now let us train a simple KNN regression model.


=== Question 5 (2 points)

Now let's train a simple KNN regression model.

.. Please complete the following code

[source,ptyhon]
[source,python]
----
from sklearn.neighbors import KNeighborsRegressor
# Initialize the KNN regressor
knn = KNeighborsRegressor(n_neighbors=3)
# Train the model on the training set
# Your code here
----
----


Project 03 Assignment Checklist
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,17 @@ To understand the core components of a simple linear regression model.

- '/anvil/projects/tdm/data/youtube/USvideos.csv' (Referred to as 'USvideos.csv' in some places in the file)



=== Question 1 (2 points)


Linear regression takes input (independent) variables and attempts to predict an output (dependent) variable.

For example, taking the number of views (independent variable) and trying to predict the number of likes (dependent variable).

First, let's load the dataset

[source,python]
----
import pandas as pd
Expand All @@ -24,29 +35,19 @@ my_df = pd.read_csv('/anvil/projects/tdm/data/youtube/USvideos.csv')
print(my_df.head())
----

[NOTE]
====
Linear regression takes input (independent) variables and attempts to predict an output (dependent) variable.
For example, taking the number of views (independent variable) and trying to predict the number of likes (dependent variable).
====

=== Question 1 (2 points)

.. What is the size of the dataset 'USvideos.csv' (How many rows and columns)?
.. Based on the explanation, identify the independent and dependent variables in the dataset.
.. Based on the initial exploration, what are the mean, median, and standard deviation of the 'likes' column (the dependent variable)?

[NOTE]
====

=== Question 2 (2 points)

Linear regression is a method that allows us to predict new values! If the model can learn enough about the patterns in the existing data, it can attempt to predict new values.

Note that the model assumes that the predictions follow a similar specific pattern, both now and in the future. If they don't, the model won't do very well. There are other modeling techniques that handle different data patterns.

Linear regression is also a core technique that many more advanced modeling types build off of.
====


We can remove outliers and clean the dataset by removing extreme values to make data cleaner. Python has a library called `scipy` that provides utilities for this purpose.

[source,python]
Expand All @@ -59,13 +60,15 @@ my_df = my_df[my_df['z_scores'].abs() < 3]
my_df = my_df.drop(columns=['z_scores'])
----

=== Question 2 (2 points)

.. Please use Z-score to remove the outliers for the 'views' column.
.. Use your own words to explain the statement you used to accomplish the task.




=== Question 3 (2 points)


The following code is used to create a boxplot for the original data:

[source,python]
Expand All @@ -86,10 +89,12 @@ plt.ylabel('Views')
plt.show()
----

=== Question 3 (2 points)

.. Please visualize the boxplot for the 'views' column after removing outliers. How do the plots differ?



=== Question 4 (2 points)

Scaling the data can ensure that features contribute equally to the model, improving model performance.

For example, scaling the 'views' column ensures that this variable is treated appropriately, even if its values are in the thousands, compared to other variables.
Expand All @@ -103,11 +108,13 @@ scaler = StandardScaler()
my_df['views'] = scaler.fit_transform(my_df[['views']])
----

=== Question 4 (2 points)

.. What is scaling, and why is it important in linear regression? Provide an example.



=== Question 5 (2 points)


We build a linear regression model by fitting a line to the data, which minimizes the sum of the squared differences between the observed values and the values predicted. The least value of this sum is called the Least Squares Error (LSE).

The following is the example program to create a Linear Regression model
Expand Down Expand Up @@ -152,8 +159,6 @@ lse = ((y_test - predictions) ** 2).sum()
print(f'Least Squares Error: {lse}')
----

=== Question 5 (2 points)

.. What is Least Squares Error (LSE) of your output?
.. Please use your own words to describe how is LSE used in linear regression?

Expand Down

0 comments on commit 4175c5f

Please sign in to comment.