# **ICE: Exploratory Data Analysis (EDA) through Descriptive Statistics**

## Name:

## *DATA 3300*

In this assignment on exploratory data analysis (EDA) through descriptive statistics, we will start by examining the normal distribution using histogram plots and comparing it to non-normal distributions. Additionally, we'll cover descriptive statistics, including measures of central tendency, shape, and spread. Finally, we'll form hypotheses based on these descriptive statistics to compare different groups and draw meaningful insights from the data




## **Company Scenario**

> GridX, a Greentech firm, focuses on large-scale decarbonization by providing analytics on clean energy options to utility and energy companies. They create models for various transitions to clean energy and offer insights and advice on pricing. This helps companies understand the financial impact of adopting greener technologies. GridX's software suite includes programs for billing analytics, customer service, and bundling clean energy options. Currently GridX operates in the US, but is looking to expand to other countries.

> Your task is to assist GridX in identifying which countries' utility and energy companies they should prioritize for marketing (beyond the US). Use descriptive statistics to explore the dataset, identify patterns, and develop hypotheses that can inform recommendations for GridX's sales team.


**Dataset variables**
* Country: Name of the country
* Region: Geographic region to which the country belongs
* SDGi (Sustainable Development Goals Index): The SDG Index evaluates countries based on indicators such as poverty, education, health, and environmental sustainability, providing a comprehensive overview of their efforts and achievements in sustainable development
* Life Expectancy:  Average life expectancy in years
* HDI (Human Development Index): Human Development Index, a composite index measuring life expectancy, education, and per capita income
* Per Capita GDP: Gross Domestic Product (GDP) per capita.
* Income Group: Income group classification (e.g., LI for Low Income, UM for Upper Middle Income)
* Population: Country population
* Carbon Footprint: Ecological footprint of carbon emissions, measured in global hectares per person
* Total Ecological Footprint (Consumption): Total ecological footprint of consumption, measured in global hectares per person
* Ecological (Deficit) or Reserve: The difference between total ecological footprint and total biocapacity, indicating whether the country has an ecological deficit (negative value) or reserve (positive value)
* Number of Earths Required: The number of Earths that would be required if everyone lived like the average person in the country.
* Total Biocapacity: Total biocapacity, measured in global hectares per person



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats

In [None]:
df = pd.read_csv('', encoding = 'latin-1')
df.head()

**1. If our main goal is to identify countries which should be prioritized by GridX, what should be some particular variables of interest for us to prioritize? Why?**

* Var 1
* Var 2
* Add as needed


**2. Take a minute - come up with a question (or two) you'd be interested in investigating related to any of these variables of interest we've just listed above -- in the context of identifying countries that should be prioritized by GridX**



1.   
2.   



**We can begin to explore these types of questions by performing EDA through descriptive statistics. But before we know which descriptive stats to prioritize, it can be helpful to examine the distribution of a variable to know whether our observations follow a normal distribution (or not).**

### Begin by checking which variables can be evaluated

In [None]:
# check info and vars

### Convert variables that should be floats!

In [None]:
# convert as necessary
for col in []:
    df[col] = pd.to_numeric(df[col], errors='coerce') # add comment

# Print some info
df.info(), df.head()

### Visualize the Distributions

In [None]:
df2 = # Remove observations with NaN values
df2.hist(layout=(), figsize=(), bins=) #generate histograms of remaining numerical vars
plt.show()

**3. How would we describe the shape of the following distributions?**

* SDGi:
* HDI:
* Carbon Footprint:
* Ecological Deficit:

**4. What implications do the shape of these distributions have for interpreting descriptive statistics? What measure of central tendency should NOT be relied on?**

## Descriptive Statistics

**Next, let's start examining some descriptive statistics. We'll begin by grouping our descriptive stats by Region, then we can also drill down to specific countries!**

**5. Return to your posed question(s) above. What information can you gather from this descriptive statistics table to address your question(s)?**

In [None]:
pd.set_option('display.max_columns', None) # display all columns in dataframe
df2.groupby('Region')[''].().sort_values()

**6. Which region has the most spread in Carbon Footprint? What measure(s) indicate this?**

In [None]:
df2.groupby('Region')[''].().sort_values(ascending=False)

**7. Take a few minutes to explore the full descriptive stats table grouped by Region. Note any interesting pattern in your descriptive statistics table, and create a hypothesis as to the cause of this finding.**

In [None]:
df2.groupby('Region')

### Let's Examine the Descriptive Stats of Two Regions we'd like to Compare...

**8. What region would you recommend targeting and why? What descriptive stats are you basing this on and what variables?**

In [None]:
# compare two regions of interest
df2.groupby('Region')[['','']].describe().loc[['', '']]


### Next, Let's drill down more into one specific region and identify the countries that we might target...

In [None]:
df3 = df2[df2['Region'] == '']
df3.head()

In [None]:
numeric_cols = df3.select_dtypes(include=np.number).columns
num_cols = len(numeric_cols)
num_rows = (num_cols + 2) // 3  # Add comments

plt.figure(figsize=(15, 5 * num_rows))  # Add comments

for i, col in enumerate(numeric_cols):
    plt.subplot(num_rows, 3, i + 1)  # Add comments
    sns.boxplot(y=df3[col])
    plt.title(col)

plt.tight_layout()  # Add comments
plt.show()


**9. What observations can we make from these distributions of just this region's countries that might help us narrow down on countries of interest?**




**10. Visualize a relationship between two variables that investigates your hypothesis, originally posed questions, or that you'd like to explore based on observations above.**

In [None]:
plt.figure(figsize=(12, 10))

# Use a colormap for the scatter plot
scatter = plt.scatter(df3[''], df3[''],
                      c=df3[''], cmap='viridis', s=100)

# Add labels for each point
for i, row in df3.iterrows():
    plt.text(row[''], row[''], row[''], fontsize=9, ha='right')

# Add colorbar
colorbar = plt.colorbar(scatter)
colorbar.set_label('colorbar label')

plt.title('')
plt.xlabel('')
plt.ylabel('')
plt.grid(True)
plt.show()


**11. Based on your findings, which countries would you recommend GridX sales team focus on, and why?**