![image.png](attachment:image.png)

# TIMESERIES AND MACHINE LEARNING PRIMER
**This chapter is an introduction to the basics of machine learning, time series data, and the intersection between the two.**

# Timeseries Kinds & Applicatons

Welcome to Introduction to Machine Learning for Timeseries Data. This course is focused on the intersection of Machine Learning and Time series data, and hence we expect you have taken introductory courses on Machine learning and time series analysis here on DataCamp.

This course focuses on machine learning in the context of timeseries data. Put simply, a timeseries means data that changes over time. This can take many different forms, such as atmospheric CO2 over time, the waveform of my voice as I am speaking.
![image.png](attachment:image.png)

the fluctuation of a stock's value over the year, or demographic information about a city.
![image-2.png](attachment:image-2.png)

4. What makes a time series?
Timeseries data consists of at least two things: One, an array of numbers that represents the data itself. Two, another array that contains a timestamp for each datapoint. The timestamps can include a wide range of time data, from months of the year to nanoseconds.
![image-3.png](attachment:image-3.png)

5. Reading in a time series with Pandas
Here we import timeseries data into a pandas DataFrame. Note that each datapoint has a corresponding time point (in this case, a date), though multiple datapoints may have the same time point.

6. Plotting a pandas timeseries
Here is the code to plot this timeseries data with Matplotlib and Pandas. We first create a figure and axis, then read in the data with Pandas and use the dot-plot method to plot the data on the axis.

7. A timeseries plot
The amount of time that passes between timestamps defines the "period" of the timeseries. In this case, it is about one day. This often helps us infer what kind of timeseries we're dealing with.

8. Why machine learning?
Machine learning has taken the world of data science by storm. In the last few decades, advances in computing power, algorithms, and community practices have made it possible to use computers to ask questions that were never thought possible. Machine learning is about finding patterns in data - often patterns that are not immediately obvious to the human eye. This is often because the data is either too large or too complex to be processed by a human.
![image-4.png](attachment:image-4.png)

9. Why machine learning?
Another crucial part of machine learning is that we can build a model of the world that formalizes our knowledge of the problem at hand. We can use this model to make predictions. Combined with automation, this can be a critical component of an organization's decision making.

10. Why combine these two?
Why should we treat timeseries any differently from another data set? Well, machine learning is all about finding patterns in data. Timeseries data always change over time, which turns out to be a useful pattern to utilize. For example, here is a raw waveform of someone speaking, and here is a collection of timeseries features that were extracted from it. As you can see, using timeseries-specific features lets us see a much richer representation of the raw data.

11. A machine learning pipeline
This course will focus on a simple machine learning pipeline in the context of timeseries data. This boils down to the following main steps. Feature extraction: what kinds of special features leverage a signal that changes over time? Model fitting: what kinds of models are suitable for asking questions with timeseries data? Validation: How can we validate a model that uses timeseries data? What considerations must we make because it changes in time?
- Feature Extraction
- Model Fitting
- Validtion

# Machine Learning Basics

Now we'll cover the basics of Machine Learning. This should be a recap of material that you've already covered in previous DataCamp courses. We'll start with the basics of how to fit and predict a model using scikit-learn.

2. Always begin by looking at your data
Before performing any data analysis, you should always take a look at your raw data. This gives you a quick high-level take on the quality/kind of your data. In Numpy, you can do so by printing out the first few rows of the data.

3. Always begin by looking at your data
In Pandas, this can be done by using the dot-head method, which shows the first five rows and all columns by default.

4. Always visualize your data
It is also crucial to visualize your data. The proper visualization will depend on the kind of data you've got, though histograms and scatterplots are a good place to start. Look at the distribution of your data. Does it seem reasonable? Are there any outliers? Are you missing data? Each of these questions is important to answer before doing any analysis.

5. Scikit-learn
Once you've gotten to know your data, it's time to start modeling it. The most popular library for machine learning in Python is called "scikit-learn". It has a standardized API so that you can fit many different models with a similar code structure. Here, we import Support Vector Machine to classify datapoints.

6. Preparing data for scikit-learn
scikit-learn expects data to have a particular shape. Before using scikit-learn, your data should be two-dimensional. The first axis should correspond to sample number, and the second should correspond to feature number. This pattern is used in almost all scikit-learn functions. If your data is not in this shape, there are a few options for reshaping it so that you can use it with scikit-learn.

7. If your data is not shaped properly
The most common approach is to "transpose" your data. This will swap the first and last axis. This is most useful when your data is two-dimensional.

8. If your data is not shaped properly
Another option is to use the dot-reshape method, which lets you specify the shape you want.

9. Fitting a model with scikit-learn
Now that your data has the correct shape, it's time to fit a model. First we must create an instance of the model we've imported (in this case, a support-vector classifier). You can call the method dot-fit on this instance to train the model. Here we show how you can input X (training data) and y (labels for each datapoint) to fit the model.

10. Investigating the model
It is often useful to investigate what kind of pattern the model has found. Most models will store this information in attributes that are created after calling dot-fit. Here we show the coefficients the model has given to each feature.

11. Predicting with a fit model
Once your model is fit, you can call the dot-predict method on the model to determine labels for unseen datapoints.

# Machine Learining & Time Series Data

In the final lesson of this chapter, we'll discuss the interaction between machine learning and timeseries data, and introduce why they're worth thinking about in tandem.

2. Getting to know our data
First, let's give a quick overview of the data we'll be using. They're both freely available online, and come from the excellent website Kaggle-dot-com.

3. The Heartbeat Acoustic Data
Audio is a very common kind of timeseries data. Audio tends to have a very high sampling frequency (often above 20,000 samples per second!). Our first dataset is audio data recorded from the hearts of medical patients. A subset of these patients have heart abnormalities. Can we use only this heartbeat data to detect which subjects have abnormalities?

4. Loading auditory data
Audio data is often stored in "wav" files. We can list all of these files using the "glob" function. It lists files that match a given pattern. Each of these files contains the auditory data for one heartbeat session, as well as the sampling rate for that data.

5. Reading in auditory data
We'll use a library called "librosa" to read in the audio dataset. Librosa has functions for extracting features, visualizations, and analysis for auditory data. We can import the data using the "load" function. The data is stored in audio and the sampling frequency is stored in sfreq. Note that the sampling frequency here is 2205, which means 2205 samples are recorded per second.

6. Inferring time from samples
Using only the sampling frequency, we can infer the timepoint of each datapoint in our audio file, relative to the start of the file.

7. Creating a time array (I)
Now we'll create an array of timestamps for our data. To do so, you have two options. The first is to generate a range of indices from zero to the number of datapoints in your audio file, divide each index by the sampling frequency, and you have a timepoint for each data point.

8. Creating a time array (II)
The second option is to calculate the final timepoint of your audio data using a similar method. Then, use the linspace function to generate evenly-spaced numbers between 0 and the final timepoint. In either case, you should have an array of numbers of the same length as your audio data.

9. The New York Stock Exchange dataset
Next, we'll explore data from the New York Stock Exchange. It runs over a much longer timespan than our audio data, and has a sampling frequency on the order of one sample per day (compared with 2,205 samples per second with the audio data). Our goal is to predict the stock value of a company using historical data from the market. As we are predicting a continuous output value, this is a regression problem.

10. Looking at the data
Let's take a look at the raw data. Each row is a sample for a given day and company. It seems that the dates go back all the way to 2010.

11. Timeseries with Pandas DataFrames
It is useful to investigate the "type" of data in each column. Numpy or Pandas may treat an array of data in special ways depending on its type. We can print the type of each column by looking at the dot-dtypes attribute. Here we see that the type of each column is "object", which is a generic data type.

12. Converting a column to a time series
Since we know one column is actually a list of dates, let's change the column type to "datetime" using the to_datetime function. This will help us perform visualization and analysis later on.

# TIMESERIES AS INPUTS TO A MODEL
**The easiest way to incorporate time series into your machine learning pipeline is to use them as features in a model. This chapter covers common features that are extracted from time series in order to do machine learning.**

# Classifying a Timeseries
# Improving Features for Classification
# The Spectrigram

# PREDICTING TIMESERIES DATA
**If you want to predict patterns from data over time, there are special considerations to take in how you choose and construct your model. This chapter covers how to gain insights into the data before fitting your model, as well as best-practices in using predictive modeling for time series data.**

# Predicting Data Over Time
# Advanced Time Series Predictions
# Creating Features Over Time

# VALIDATING AND INSPECTING TIMESERIES MODELS
**Once you've got a model for predicting time series data, you need to decide if it's a good or a bad model. This chapter coves the basics of generating predictions with models in order to validate them against "test" data.**

# Creating Features from the Past
# Cross-Validating Time Series Data
# Stationarity & Stability
