# Exploratory Data Analysis

One of the most important steps when performing any kind of analysis on large
sets of data is to validate the data. A simple and effective way of doing this
is by performing Exploratory Data Analysis (EDA). This consists of a wide range of
things from examining the set for missing data points or gathering some
statistical measures from the dataset, like the mean, standard deviation, and
variance. This can be done manually, but using Python packages like `summarytools`
and `ydata_profiling` streamline this process and give on-demand EDA. Using
multiple tools for EDA helps to verify the data even further by allowing for
crossreferencing outputs to ensure validity. In the cells below, this fast
and easy process can be explored yourself through the power of Jupyter Notebooks.

## Installing Dependencies and Importing Libraries

Before performing any analysis, the necessary dependencies and libraries must be
obtained, matplotlib is a tricky one, as different versions are needed for
different processes in other notebooks, so we will have to install this one
manually. Each cell can be ran with `Shift+Enter` or by hitting the "play"
symbol in the toolbar.

In [None]:
! pip install matplotlib=="3.6.0"

Notice we are also importing `pandas`, another very useful library that
helps with managing our data and storing them into data frames.

In [2]:
from summarytools import dfSummary
from ydata_profiling import ProfileReport
import pandas as pd

## Performing EDA with Summarytools

`Summarytools` is a package that allows for very fast, but simple EDA. It
generates a report very quickly, as the cost of less interactivity and
robustness. This does not make it useless however, as it still provides
valuable insights into properties of statistical distributions, as well
as some simple visualizations to help put the data into perspective.

The first step requires reading in the data into a `pandas` data frame.
Here, we make two, one for the wind data, and another for the solar data.

In [4]:
wind_farm_data = pd.read_csv('../data/wind.csv')
solar_farm_data = pd.read_csv('../data/solar.csv')

Now that it is loaded in, the wind farm data can be analized
using `summarytools`.

In [None]:
dfSummary(wind_farm_data)

Same goes for the solar farm data.

In [None]:
dfSummary(solar_farm_data)

## Performing EDA with YData Profiling

`YData Profiling` is somewhat of an opposite to `summarytools` in that it
generates a very robust report with many visualizations, at the cost of
a slower generation speed. This trade-off is okay in this case, however, as
its the EDA that means the most.

Here, we are using the `pandas` data frames from earlier and generating
a profile report for both the wind data and the solar data.

In [7]:
wind_profile = ProfileReport(wind_farm_data, title="Wind Farm DataReport")
solar_profile = ProfileReport(solar_farm_data, title="Solar Farm DataReport")

Now that the wind report is generated, we can display it in an interactive widget
using some features of `ydata_profiling's` library. It allows for creating and
twekaing graphs and tables. Give it a try and explore the data before moving on.

In [None]:
wind_profile.to_notebook_iframe()

Same goes for the solar data.

In [None]:
solar_profile.to_notebook_iframe()

## Next Steps

Now that all of the data has been explored and verified, its time to start
using it. To do this, we will use the `scikit-learn` Python package to
perform some modeling and predictions. In [k_means.ipynb](./k_means.ipynb), we take a look at how
we can model some attributes of our data sets using clusters and make
predictions based on that model.