# 2-Week Roadmap

A 2-week roadmap for you, which builds on the topics from Joel Grus's "Data Science from Scratch" and incorporates additional resources to help you gain a deeper understanding of the subject. This roadmap includes practical projects that you can work on to apply your newfound knowledge.

**Week 1**

**Day 1: Data Exploration and Visualization**
- Learn the basics of pandas: [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
- Data Wrangling with pandas: [Data Wrangling with pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- Learn the basics of Matplotlib: [Matplotlib tutorial](https://matplotlib.org/stable/tutorials/introductory/pyplot.html)
- Introduction to seaborn: [seaborn tutorial](https://seaborn.pydata.org/tutorial.html)

**Project 1**: Explore and visualize the [Titanic dataset](https://www.kaggle.com/c/titanic/data) using pandas, Matplotlib, and seaborn.

**Day 2-3: Supervised Machine Learning and scikit-learn**
- Introduction to scikit-learn: [Getting started with scikit-learn](https://scikit-learn.org/stable/getting_started.html)
- Supervised learning algorithms: [Supervised learning tutorial](https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html)
- Model evaluation: [Model evaluation tutorial](https://scikit-learn.org/stable/modules/model_evaluation.html)

**Project 2**: Predict survival on the Titanic using various supervised learning algorithms from scikit-learn. Evaluate the performance of your models.

**Day 4: Unsupervised Machine Learning**
- Unsupervised learning algorithms: [Unsupervised learning tutorial](https://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html)
- Clustering with k-means: [k-means clustering](https://scikit-learn.org/stable/modules/clustering.html#k-means)

**Project 3**: Perform customer segmentation using the [Mall Customer Segmentation dataset](https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python) with k-means clustering.

**Day 5: Feature Engineering and Dimensionality Reduction**
- Feature engineering: [Feature engineering guide](https://elitedatascience.com/feature-engineering-best-practices)
- PCA for dimensionality reduction: [PCA tutorial](https://scikit-learn.org/stable/modules/decomposition.html#pca)

**Project 4**: Apply feature engineering and PCA to improve the performance of your supervised learning models on the Titanic dataset.

**Week 2**

**Day 6-7: Introduction to Neural Networks and TensorFlow**
- Neural Networks: [Neural Networks and Deep Learning book](http://neuralnetworksanddeeplearning.com/)
- TensorFlow basics: [Get started with TensorFlow](https://www.tensorflow.org/tutorials/quickstart/beginner)

**Project 5**: Classify handwritten digits using the [MNIST dataset](https://www.tensorflow.org/datasets/catalog/mnist) with TensorFlow.

**Day 8-9: Deep Learning with Keras**
- Keras introduction: [Keras guide](https://keras.io/guides/)
- Convolutional Neural Networks (CNNs): [CNNs for visual recognition](http://cs231n.github.io/convolutional-networks/)

**Project 6**: Classify images using the [CIFAR-10 dataset](https://www.tensorflow.org/datasets/catalog/cifar10) with a CNN implemented in Keras.

**Day 10-11: Natural Language Processing**
- Introduction to NLP: [NLP with Python book, Chapter 1](https://www.nltk.org/book/ch01.html)
- Text preprocessing: [Text preprocessing guide](https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html)
- Text classification with scikit-learn: [Working with text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)

**Project 7**: Sentiment analysis on the [IMDb movie reviews dataset](https://ai.stanford.edu/~amaas/data/sentiment/) using scikit-learn.

**Day 12: Word Embeddings and Word2Vec**
- Word embeddings: [Word embeddings tutorial](https://www.tensorflow.org/tutorials/text/word_embeddings)
- Word2Vec: [Word2Vec explained](https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa)

**Project 8**: Train a Word2Vec model on a large text corpus, such as [Gutenberg books](https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html), and explore the generated embeddings.

**Day 13: Recurrent Neural Networks (RNNs) and LSTMs**
- RNNs and LSTMs: [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- Implementing LSTMs with Keras: [A ten-minute introduction to sequence-to-sequence learning in Keras](https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html)

**Project 9**: Build an RNN or LSTM model to generate text based on a large text corpus.

**Day 14: Review and Next Steps**
- Review the topics and projects from the past two weeks.
- Identify areas where you need more practice or further study.
- Plan your next steps in learning data science and AI, such as exploring reinforcement learning, unsupervised deep learning, or advanced NLP techniques like transformers.

Throughout this roadmap, remember to consult the documentation for the various Python packages you'll be using, as well as online forums and communities such as Stack Overflow and the relevant subreddits on Reddit.

As you work on these projects, feel free to ask me questions or seek guidance. Good luck on your learning journey!

# Matpotlib

[Quick start guide](https://matplotlib.org/stable/tutorials/introductory/quick_start.html)

In [None]:
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np

Matplotlib graphs your data on Figures (e.g., windows, Jupyter widgets, etc.), each of which can contain one or more Axes. The simplest way of creating a Figure with an Axes is using pyplot.subplots. We can then draw some data on the Axes:

In [None]:
fig, ax = plt.subplots()  # Create a figure containing a single axes.
ax.plot([1, 2, 3, 4], [1, 4, 2, 3])  # first list is x, second is y
plt.show()

In [None]:
# imo this is the simplest way to plot
plt.plot([1, 2, 3, 4], [1, 4, 2, 3])  # first list is x, second is y
plt.show()

see [cheat sheet](matplotlib_cheat_sheet.jpg)

## Figure

The Figure keeps track of all the child Axes, a smattering of ‘special’ artists (titles, figure legends, etc), and the canvas. (Don’t worry too much about the canvas, it is crucial as it is the object that actually does the drawing to get you your plot, but as the user it is more-or-less invisible to you). 

A Figure can contain any number of Axes, but will typically have at least one.

In [None]:
fig = plt.figure()  # an empty figure with no Axes

In [None]:
fig, ax = plt.subplots()  # a figure with a single Axes

In [None]:
fig, axs = plt.subplots(2, 2)  # a figure with a 2x2 grid of Axes # note the s in subplots

### Axes
An Axes is an Artist attached to a Figure that contains a region for plotting data, and usually includes two (or three in the case of 3D) *axis* objects (be aware of the difference between Axes and Axis) that provide ticks and tick labels to provide scales for the data in the Axes. Each Axes also has a title (set via set_title()), an x-label (set via set_xlabel()), and a y-label set via set_ylabel()).

*It is basically a graph within the graph.*

### Axis
These objects set the scale and limits and generate ticks (the marks on the Axis) and ticklabels (strings labeling the ticks). The location of the ticks is determined by a Locator object and the ticklabel strings are formatted by a Formatter. The combination of the correct Locator and Formatter gives very fine control over the tick locations and labels.

### Artist
Basically, everything visible on the Figure is an Artist (even Figure, Axes, and Axis objects). This includes Text objects, Line2D objects, collections objects, Patch objects, etc. When the Figure is rendered, all of the Artists are drawn to the canvas. Most Artists are tied to an Axes; such an Artist cannot be shared by multiple Axes, or moved from one to another.

## Types of inputs to plotting functions

In [None]:
# best to convert to numpy arrays with np.asarray
b = np.matrix([[1, 2], [3, 4]])
b_asarray = np.asarray(b)
b_asarray

```python
fig, ax = plt.subplots(figsize=(5, 2.7), layout='constrained')

# is equivalent to

plt.figure(figsize=(5, 2.7), layout='constrained')
```

## Labelling plots

and how `n, bins, patches` looks in `ax.hist()` and interprets it.

In [None]:
mu, sigma = 115, 15
x = mu + sigma * np.random.randn(10000)
fig, ax = plt.subplots(figsize=(8, 4))

# the histogram of the data
n, bins, patches = ax.hist(x, 50, density=True, facecolor='C0', alpha=0.75)

ax.set_xlabel('Length [cm]')
ax.set_ylabel('Probability')
ax.set_title('Aardvark lengths\n (not really)')
ax.text(75, .025, r'$\mu=115,\ \sigma=15$')
ax.axis([55, 175, 0, 0.03])
ax.grid(True)


# Pyplot tutorial
## Introduction to pyplot

`pyplot` is a collection of functions.  Each ``pyplot`` function makes some change to a figure:
e.g., creates a figure, creates a plotting area in a figure, plots some lines
in a plotting area, decorates the plot with labels, etc.

In `matplotlib.pyplot` various states are preserved
across function calls, so that it keeps track of things like
the current figure and plotting area, and the plotting
functions are directed to the current axes (please note that "axes" here
and in most places in the documentation refers to the *axes*
`part of a figure`
and not the strict mathematical term for more than one axis).

Generating visualizations with pyplot is very quick:

In [None]:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()

You may be wondering why the x-axis ranges from 0-3 and the y-axis
from 1-4.  **If you provide a single list or array to
`~.pyplot.plot`, matplotlib assumes it is a
sequence of y values, and automatically generates the x values for
you.**  Since python ranges start with 0, the default x vector has the
same length as y but starts with 0; therefore, the x data are
``[0, 1, 2, 3]``.

`~.pyplot.plot` is a versatile function, and will take an arbitrary number of
arguments.  For example, to plot x versus y, you can write:



In [None]:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])

### Formatting the style of your plot

For every x, y pair of arguments, there is an optional third argument
which is the format string that indicates the color and line type of
the plot.  The letters and symbols of the format string are from
MATLAB, and you concatenate a color string with a line style string.
*The default format string is 'b-', which is a solid blue line*.  For
example, to plot the above with red circles, you would issue:

In [None]:
plt.plot([1, 2, 3, 4], [1, 4, 9, 16], 'ro') # note the 'r' for red and 'o' for circles.
plt.axis([0, 6, 0, 20])                     # The `.axis` function takes a list of `[xmin, xmax, ymin, ymax]`
plt.show()

It takes x, y, (colour), x, y, (colour),... 

In [None]:
import numpy as np

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^') # red dashes, blue squares and green triangles
plt.show()


## Plotting with keyword strings

With a `DataFrame`, Matplotlib allows you to provide such an object with a keyword argument. If provided, you may generate plots with
the strings corresponding to these variables.

In [None]:
data = {'a': np.arange(50),
        'c': np.random.randint(0, 50, 50),
        'd': np.random.randn(50)}

data['b'] = data['a'] + 10 * np.random.randn(50)
data['d'] = np.abs(data['d']) * 100

plt.scatter('a', 'b', c='c', s='d', data=data) # so just notice the data=data
plt.xlabel('entry a')
plt.ylabel('entry b')
plt.show()

## Plotting with categorical variables

It is also possible to create a plot using categorical variables.
Matplotlib allows you to pass categorical variables directly to
many plotting functions. For example:

In [None]:
names = ['group_a', 'group_b', 'group_c']
values = [1, 10, 100]

plt.figure(figsize=(9, 3))

plt.subplot(131)    # 1 row, 3 columns, 1st plot, equivalent to plt.subplot(1, 3, 1)
plt.bar(names, values)
plt.subplot(132)
plt.scatter(names, values)
plt.subplot(133)
plt.plot(names, values)
plt.suptitle('Categorical Plotting')
plt.show()


## Controlling line properties

Lines have many attributes that you can set: linewidth, dash style,
antialiased, etc.  There are
several ways to set line properties


* Use keyword arguments:
      plt.plot(x, y, **linewidth**=2.0)


* Use the setter methods:

     ```python
      line, = plt.plot(x, y, '-')
      line.set_antialiased(False) # turn off antialiasing```
      
* Use `.setp`

```python
      lines = plt.plot(x1, y1, x2, y2)
      # use keyword arguments
      plt.setp(lines, color='r', linewidth=2.0)
```

## Working with multiple figures and axes

MATLAB, and `.pyplot`, have the concept of the current figure
and the current axes.  All plotting functions apply to the current
axes.  

The function `~.pyplot.gca` (Get the Current Axes) returns the current axes, and `~.pyplot.gcf` returns the current
figure.



In [None]:
def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure()
plt.subplot(211)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(212)
plt.plot(t2, np.cos(2*np.pi*t2), 'r--')
plt.show()

You can clear the current figure with `~.pyplot.clf`
and clear everything with `~.pyplot.close`.

## Working with text

`~.pyplot.text` can be used to add text in an arbitrary location, and
`~.pyplot.xlabel`, `~.pyplot.ylabel` and `~.pyplot.title` are used to add
text in the indicated locations



In [None]:
mu, sigma = 100, 15
x = mu + sigma * np.random.randn(10000)

# the histogram of the data
n, bins, patches = plt.hist(x, 50,
                            density=True, 
                            # cumulative=True,           # good to know. This is a density function. 
                            # facecolor='g', alpha=0.75
                            )


plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
plt.show()

In [None]:
bins

### Using mathematical expressions in text

Matplotlib accepts TeX equation expressions in any text expression.
For example to write the expression $\sigma_i=15$ in the title,
you can write a TeX expression surrounded by dollar signs::
```python
    plt.title(r'$\sigma_i=15$')
```
The ``r`` preceding the title string is important -- it signifies
that the string is a *raw* string and not to treat backslashes as
python escapes.  

### Annotating text

The uses of the basic `~.pyplot.text` function above
place text at an arbitrary position on the Axes.  A common use for
text is to annotate some feature of the plot, and the
`~.pyplot.annotate` method provides helper
functionality to make annotations easy.  In an annotation, there are
two points to consider: the location being annotated represented by
the argument ``xy`` and the location of the text ``xytext``.  Both of
these arguments are ``(x, y)`` tuples.



## Logarithmic and other nonlinear axes

`matplotlib.pyplot` supports not only linear axis scales. Changing the scale of an axis is easy:

    plt.xscale('log')

An example of four plots with the same data and different scales for the y-axis
is shown below.



In [None]:
# Fixing random state for reproducibility
np.random.seed(19680801)

# make up some data in the open interval (0, 1)
y = np.random.normal(loc=0.5, scale=0.4, size=1000)
y = y[(y > 0) & (y < 1)]
y.sort()
x = np.arange(len(y))

# plot with various axes scales
plt.figure()

# linear
plt.subplot(221)
plt.plot(x, y)
plt.yscale('linear')
plt.title('linear')
plt.grid(True)

# log
plt.subplot(222)
plt.plot(x, y)
plt.yscale('log') # note the 'log' here
plt.title('log')
plt.grid(True)

plt.show()

It is also possible to add your own scale, see `matplotlib.scale` for
details.



# Seaborn tutorial

see [here](https://seaborn.pydata.org/tutorial.html)

In [None]:
# Import seaborn
import seaborn as sns

# Apply the default theme
sns.set_theme()

# Load an example dataset
tips = sns.load_dataset("tips")

# Create a visualization
sns.relplot(
    data=tips,
    x="total_bill", 
    y="tip", 
    col="time",
    hue="smoker", 
    style="smoker", 
    size="size",
)

Notice how we provided only the names of the variables and their roles in the plot. Unlike when using matplotlib directly, it wasn’t necessary to specify attributes of the plot elements in terms of the color values or marker codes. Behind the scenes, seaborn handled the translation from values in the dataframe to arguments that matplotlib understands. This declarative approach lets you stay focused on the questions that you want to answer, rather than on the details of how to control matplotlib.

The function `relplot()` is named that way because it is designed to visualize many different **statistical relationships**. While scatter plots are often effective, relationships where one variable represents a measure of time are better represented by a line. The relplot() function has a convenient kind parameter that lets you easily switch to this alternate representation:

In [None]:
dots = sns.load_dataset("dots")
sns.relplot(
    data=dots, 
    # kind="line",    # note how the line is a kind of relplot()
    x="time", 
    y="firing_rate", 
    col="align",
    hue="choice", 
    size="coherence", 
    style="choice",
    facet_kws=dict(sharex=False),
)

Notice how the size and style parameters are used in both the scatter and line plots, but they affect the two visualizations differently: changing the marker area and symbol in the scatter plot vs the line width and dashing in the line plot. We did not need to keep those details in mind, letting us focus on the overall structure of the plot and the information we want it to convey.

In [None]:
# Many seaborn functions will automatically perform the statistical estimation that is necessary to answer these questions:

fmri = sns.load_dataset("fmri")
sns.relplot(
    data=fmri, 
    kind="line",
    x="timepoint", 
    y="signal", 
    col="region",
    hue="event", 
    style="event",
)

When statistical values are estimated, seaborn will use bootstrapping to compute confidence intervals and draw error bars representing the uncertainty of the estimate.

Statistical estimation in seaborn goes beyond descriptive statistics. For example, it is possible to enhance a scatterplot by including a linear regression model (and its uncertainty) using `lmplot()`:

### lmplot()
for linear regression

In [None]:
sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker")

### displot()

and histplot()

In [None]:
# for distributional representations / histograms etc seaborns uses displot
sns.displot(data=tips, x="total_bill", col="time", kde=True)

these show how total bill is not a normal distribution. 

### jointplot()

In [None]:
penguins = sns.load_dataset("penguins")
sns.jointplot(data=penguins, x="flipper_length_mm", y="bill_length_mm", hue="species")

### pairplot()

In [None]:
sns.pairplot(data=penguins, hue="species")

# Day 2-3

https://scikit-learn.org/stable/tutorial/basic/tutorial.html#introduction

https://scikit-learn.org/stable/tutorial/statistical_inference/supervised_learning.html

https://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html

https://scikit-learn.org/stable/modules/model_evaluation.html


## Some essential questions I asked and got answered:

#### 1. **How to plot the distribution of random variables**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set the seed for reproducibility
np.random.seed(123)

# Generate 1000 random variables from a normal distribution
random_variables = np.random.normal(0, 1, 1000)

# Plot the distribution of the random variables
sns.histplot(random_variables, 
             kde=True             # to draw a line through it
             )

plt.xlabel('Value')
plt.ylabel('Frequency')
plt.title('Distribution of Random Variables')
plt.show()

In [None]:
plt.rcParams['figure.dpi'] = 200

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np

# Create a histogram of the 'tip' column
sns.histplot(tips['tip'], stat='density', binwidth=0.1,  color='purple')

In [None]:
# Calculate the normal distribution based on the fitted parameters
x_norm = np.linspace(tips['tip'].min(), tips['tip'].max())

# with a normal distribution to the 'tip' data
mu, sigma = stats.norm.fit(tips['tip'])
y_norm = stats.norm.pdf(x_norm, mu, sigma)

`x_norm` returns a numpy array `numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)`: With evenly spaced numbers over a specified interval.

`y_norm` is a Probability density function with `pdf(x, loc=0, scale=1)`: 

In [None]:
# Plot the normal distribution overlay
plt.plot(x_norm, y_norm, color="green", label=f"Normal dist. (μ={mu:.2f}, σ={sigma:.2f})")
plt.title("Tip Distribution")
plt.xlabel("Tip")
plt.ylabel("Density")
plt.legend()

#### 2. **how to plot dependent variables and show how they correlate with each other**

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
import pandas as pd

# Load the iris dataset
iris = load_iris()
data = iris.data
feature_names = iris.feature_names
target = iris.target

# Convert to DataFrame for easier manipulation
df = pd.DataFrame(data, columns=feature_names)
df['target'] = target

# Create a pairplot
sns.pairplot(df, hue='target')

In this script, the iris dataset is loaded and converted into a DataFrame. Then, `sns.pairplot(df, hue='target')` is used to create a pairplot. The `hue='target'` option colors the points according to their class, which makes it easier to see how the features correlate with the target variable.

Each plot on the diagonal shows the distribution of a single feature, and the plots off the diagonal show the relationships between pairs of features. By looking at these plots, you can see how the features correlate with each other and with the target variable.

### The diagonal uses KDE, Kernel Density Estimation.

It's a technique used to smooth a histogram and create an estimate of the probability density function (PDF) of a random variable.

A histogram can be noisy or misleading depending on how the bins (the ranges of values) are chosen, and they don't give a smooth, continuous estimate of the underlying distribution. 

KDE addresses these issues by placing a continuous "kernel" function on each data point. The kernels are then summed to create a smooth estimate of the distribution. The kernel functions are usually Gaussian, but other shapes can be used as well.

In data visualization, KDE is often used to create smoothed versions of histograms or to estimate the distribution of data in scatter plots. It's a useful technique for understanding the shape of the data distribution, especially when the number of data points is large.

In [None]:
sns.pairplot(tips, diag_kind='kde')

#### **3. numpy arrays**

`X` is the samples matrix (or design matrix). The size of X is typically (n_samples, n_features), which means that samples are represented as rows and features are represented as columns.

`y` is the target variable. 

Both X and y are usually expected to be *numpy arrays*

In [None]:
X.ndim # and .shape give the np.array() dimensions and shape

The shape of a numpy array tells you **how many dimensions** the array has and **how many elements are in each dimension**. For example, the shape (1, 3) means that the array is 2-dimensional, with 1 (unnecessary) element in the first dimension and 3 elements in the second dimension.

In [None]:
X[0,0] # gives the index for each dimension

Multi-dimensional arrays should be seen as matrices: 

a 2 dimensional array, like:

```python
    b = np.array([[1, 2, 3], [4, 5, 6]])
```

has each 2 samples and 3 features. 

**Each sample is a row, each feature is a column.**

| label | Col 1 | Col 2 | Col 3 |
| :--: | :--: | :--: | :--: |
| measurement 1 | 1 | 2 | 3 |
| measurement 2 | 4 | 5 | 6 |


In [None]:
b = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
b

A 3-dimensional array, like:

In [None]:
# 3-dimensional array
c = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
c

In [None]:
c_sum = c + 5 # the book develops all this vector arithmetic from scratch: add, subtract, scalar_multiply (which multiplies with a constant)
c_sum           # the book goes on: vector_mean, dot product, sum of squares, magnitude (square root of sum of squares), and distance*

has sublists as matrices, which in turn have sublists as rows. And these matrices are also behind each other.

In [None]:
c[0,1,1] # is the index of 5

In [None]:
# 4-dimensional array
d = np.array([
    [
        [[1, 2, 3], [4, 5, 6]], 
        [[7, 8, 9], [10, 11, 12]]
    ], 
    [
        [[13, 14, 15], [16, 17, 18]], 
        [[19, 20, 21], [22, 23, 24]]
    ]
])
d

In this example, d is a 4-dimensional array containing two 3-dimensional arrays. Each of these 3-dimensional arrays contains two 2-dimensional arrays, and each of these 2-dimensional arrays contains two 1-dimensional arrays.

You can access elements of a 4-dimensional array using four indices. The first index selects which 3-dimensional array to access, the second index selects which 2-dimensional array within that 3-dimensional array to access, the third index selects which 1-dimensional array within that 2-dimensional array to access, and the fourth index selects which element within that 1-dimensional array to access.

For example, d[1, 0, 1, 2] would give the value in the second 3-dimensional array, first 2-dimensional array, second 1-dimensional array, and third element, which is 18 in this case.

In [None]:
# note that you can get any sublist by index 
d[0,1] 

In [None]:
# *as one of the kinds of vector arithmatic the book defines is distance, here's numpy does it: 
# note how it uses its linear algebra module, the submodule norm in specific which does vector arhitmetic

# Define two 2-dimensional vectors (arrays)
a = np.array([[1, 2, 3], [4, 5, 6]])
b = np.array([[7, 8, 9], [10, 11, 12]])

# Calculate the Euclidean distance between a and b
distance = np.linalg.norm(a - b)

print(distance)

In [None]:
a.ndim

In [None]:
import numpy as np

# Set the seed for reproducibility
np.random.seed(0)

# 10 random numbers in one dimension
one_d = np.random.rand(10)
print("One dimension:\n", one_d)

# 10 random numbers in two dimensions (5 rows, 2 columns)
two_d = np.random.rand(5, 2)
print("\nTwo dimensions:\n", two_d)

# 10 random numbers in three dimensions (2 layers, 5 rows, 2 columns)
three_d = np.random.rand(2, 5, 2)
print("\nThree dimensions:\n", three_d)


note: **layers**

Also note that when you 'print' an array it's not seperated by commas. However, when you do:

In [None]:
three_d

Note that pandas dataframe columns translate to dimensions. 

> In linear regression, I am to find the coefficients (betas) for each dimension. These are not vector representations though. That's where i'm confused. 


In the context of linear regression, each column in your DataFrame represents a different dimension or > **feature** < of your data. Each row in the DataFrame represents a different observation or data point.

For example, let's say you have a DataFrame with three columns: 'Height', 'Weight', and 'Age'. Each of these columns represents a different dimension of your data. If you were to perform a linear regression with 'Height' as your dependent variable and 'Weight' and 'Age' as your independent variables, you would be fitting a plane in three-dimensional space (Height, Weight, Age) that best fits your data.

The coefficients you get from the linear regression represent the relationship between each independent variable and the dependent variable. For example, the coefficient for 'Weight' represents how much 'Height' is expected to change for a one-unit increase in 'Weight', assuming 'Age' is held constant.

In this sense, 
> the **coefficients are not vector representations** of your data. Instead, they are **scalar values** that describe the relationship between your variables. The vector representation of a data point would be the list of all its feature values. 
> 
For example, a person might be represented by the vector [Height, Weight, Age].

#### **4. f-strings with list comprehensions and decimal formatting**

Something else I've been quite impressed by is how you can write a f-string in a list comprehensions in an attribute:

```python
    df = pd.DataFrame(X, columns=[f'feature_{i+1}' for i in range(X.shape[1])])
```
this outputs columns:
> feature_1 feature_2 etc.

Another cool thing is writing the variable like this `{variable:.2f}` to reduce the number decimals.

# **Day 5: Feature Engineering and Dimensionality Reduction**

- Feature engineering: [Feature engineering guide](https://elitedatascience.com/feature-engineering-best-practices)



## Feature engineering:

**Indicator Variables**

The first type of feature engineering involves using indicator variables to isolate key information.


- eg. from thresholds: Let’s say you’re studying alcohol preferences by U.S. consumers and your dataset has an age feature. You can create an indicator variable for age >= 21 to distinguish subjects who were over the legal drinking age.

Indicator variables can be created from thresholds, combine multiple features, from special events, or for groups of classes:

**Interaction Features**

some features can be combined to provide more information than they would as individuals. 

- Sum of two features: Let’s say you wish to predict revenue based on preliminary sales data. You have the features sales_blue_pens and sales_black_pens. You could sum those features if you only care about overall sales_pens.

Look for opportunities to take the sum, difference, product, or quotient of multiple features. Think of built date and sold date to create an age feature, or price and customers to create a revenue stream feature.

**External Data**

This can lead to some of the biggest breakthroughs in performance. Eg, one way quantitative hedge funds perform research is by layering together different streams of financial data.

- time series data: allows layering-in any other time-series data
- APIs
- Geocoding: see this free api https://geoservices.tamu.edu/Services/Geocode/WebService/Details/

**Error Analysis (Post-Modeling)**

Analyse misclassified or high error observations.

You can then try collecting more data, splitting the problem apart, or engineering new features that address the errors. To use error analysis for feature engineering, you’ll need to understand why your model missed its mark.

## Aside on iterators

I want to cycle through a list

In [None]:
z = iter([1,2,3])

In [None]:
next(z)

In [None]:
import itertools

a = itertools.cycle([1,2,3])

In [None]:
next(a)

In [None]:
b = iter(['change_prompt', 'change_prompt2', 'change_prompt3', 'change_prompt4']) # (input=q)

In [None]:
next(b)