# Exploratory Data Analysis (EDA)

## Importing and Cleaning the Dataset

This section aims to revel trends and patterns of our data inorder to gain a deeper understanding. Using the insights we gain here, we can further extract various information on our next section.

## Tools

I'll be using the library pandas, pyplot and seaborn to facilitate in data manipulation and performing initial assessment of the dataset's strucutre


## Approach

### Preparing Our Data

#### Analyzing and Importing our data


In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt

In [33]:
def extract_file_names(folder_path):
    # List all files in the folder
    files = os.listdir(folder_path)

    # Filter and get only files (excluding directories)
    file_names = [f for f in files if os.path.isfile(os.path.join(folder_path, f))]

    return file_names

In [None]:
filelist = {}
for file in extract_file_names("../data"):
    filelist[file] = pd.read_csv("../data/{}".format(file),encoding="latin1")

In [None]:
for file in filelist:
    print(file)
    print(filelist[file].columns)

In [None]:
for file in filelist:
    print(file)
    print(filelist[file].shape)

Let's continue by performing various Univariate analysis in order to gain a deeper understanding of the data

In [None]:
for file in filelist:
    print(file)
    print(filelist[file].dtypes)

Even though most of it is object we can see that the first row is the unit row so we remove that for more accuracy

In [38]:
for file in filelist:
    filelist[file] = filelist[file].drop(0)
    filelist[file] = filelist[file].reset_index(drop=True)

    

Let's change the type to float for better analysis for columns ["GHI","DNI","DHI","ModA","ModB","WS", "WSgust"]

In [39]:
for file in filelist:
    filelist[file][["GHI","DNI","DHI","ModA","ModB","WS", "WSgust"]] = filelist[file][["GHI","DNI","DHI","ModA","ModB","WS", "WSgust"]].astype(float)

In [None]:
for file in filelist:
    print(file)
    print(filelist[file].head(5))

In order to solve this issue let's first clean the data 

Let's begin by checking for duplicated values

In [None]:
for file in filelist:
    print(file)
    print("Count of duplicated values in {} \n".format(file))
    print(filelist[file].duplicated().sum())


In [None]:
for file in filelist:
    print(file)
    print("Count of null values in {} \n".format(file))
    print(filelist[file].isnull().sum())


Let's begin by removing the comment column since it contains most of the null values

In [43]:
for file in filelist:
    filelist[file].drop(columns="Comments",inplace=True)


First Thing we need to fix is that Columns like GHI, DNI and DHI are positive so we need to change the value

In [None]:
filelist.keys()

In [45]:
for file in filelist:
    filelist[file][["GHI","DNI","DHI"]] = filelist[file][["GHI","DNI","DHI"]].abs()


Next let's examine how much of the data is outlier

In [59]:
for file in filelist:
    print(file)
    for column in ["ModA","ModB","WS", "WSgust"]:
        print(column)
        filelist[file][column] = filelist[file][column].abs()
        # Calculate Q1 (25th percentile) and Q3 (75th percentile)
        Q1 = filelist[file][column].quantile(0.25)
        Q3 = filelist[file][column].quantile(0.75)

        # Calculate IQR
        IQR = Q3 - Q1

        # Define the bounds for outliers
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Identify outliers
        outliers = filelist[file][column][(filelist[file][column] < lower_bound) | (filelist[file][column] > upper_bound)]

        print("Number of Outliers:",outliers.count())


solar-measurements_benin-malanville_qc.csv
ModA
Number of Outliers: 98
ModB
Number of Outliers: 240
WS
Number of Outliers: 6717
WSgust
Number of Outliers: 5368
solar-measurements_benin-parakou_qc.csv
ModA
Number of Outliers: 9688
ModB
Number of Outliers: 9423
WS
Number of Outliers: 3578
WSgust
Number of Outliers: 3506
solar-measurements_sierraleone-bumbuna_qc.csv
ModA
Number of Outliers: 21017
ModB
Number of Outliers: 20613
WS
Number of Outliers: 3169
WSgust
Number of Outliers: 3665
solar-measurements_sierraleone-kenema_qc.csv
ModA
Number of Outliers: 32869
ModB
Number of Outliers: 32228
WS
Number of Outliers: 7603
WSgust
Number of Outliers: 3778
solar-measurements_togo-dapaong_qc.csv
ModA
Number of Outliers: 1237
ModB
Number of Outliers: 1537
WS
Number of Outliers: 8708
WSgust
Number of Outliers: 7377
solar-measurements_togo-davie_qc.csv
ModA
Number of Outliers: 32688
ModB
Number of Outliers: 31063
WS
Number of Outliers: 3348
WSgust
Number of Outliers: 3986
