
# Data Analysis Project

**Students Names:** Auriana Mitchell, Scarlett Shi <br>
**Course Name:**  MSBA-601-01 Fund Tech for Bus Analytics (Fall 2024) <br>
**Date:** 11/20/2024<br>

---

### Project Overview

In this project, you will work in a team of two and utilize what we have learned in this class to preform an analysis on a dataset. 

The data you choose is up to you! I encourage you to select a dataset that fits your area of expertise. Some of the resources for datasets are listed below. Note that these are just resources I've used in the past. You are not limited to these resources, but it's a place to start.<br>
    <br>
    - [kaggle](https://www.kaggle.com/datasets)<br>
    - [data.gov](https://data.gov/)<br>
    - [google datasets](https://datasetsearch.research.google.com/)<br>
    - [SEC datasets](https://www.sec.gov/data-research/sec-markets-data/financial-statement-data-sets)<br>
    - [dataquest](https://www.dataquest.io/blog/free-datasets-for-projects/)<br>

Your dataset must have at LEAST 10,000 rows of data. Additionally, it should have enough attributes to allow for a meaningful analysis and level of complexity (e.g., a dataset with just two columns for "name" and "birthdate" likely is not going to be enough.)

Once you have selected a dataset, you will be required to perform an analysis

 with a large dataset of at least 10,000 rows. The goal is to:
- Import and explore the dataset.
- Clean and prepare the data for analysis.
- Derive at least 6 key insights from the data using Python and Python libraries. You are welcome to add more coding blocks and more insights. 
- Communicate these insights clearly, professionally, and explain how they can inform decision-making.

---


Data Overview 

The dataset contains 36733 instances of 11 sensor measures aggregated over one hour, from a gas turbine located in Turkey for the purpose of studying flue gas emissions, namely CO and NOx.

What is gas turbine?

A gas turbine is a type of combustion engine that converts the energy from burning fuel (usually natural gas, but sometimes oil or other fuels) into mechanical energy, which can be used to drive a generator or other mechanical equipment. Gas turbines are widely used for electricity generation, propulsion in aircraft, and in various industrial applications.

See Gas Turbine image

https://energyeducation.ca/encyclopedia/Gas_turbine

Data Variables:

Ambient temperature (AT):Measures the temperature of the air surrounding the system. It's important for understanding environmental conditions and their effects on system performance.

Ambient pressure (AP):Measures the atmospheric pressure of the surrounding environment. This affects air density and, consequently, the efficiency of processes like combustion and air compression.


Ambient humidity (AH):Measures the relative moisture content in the surrounding air, expressed as a percentage. High humidity can affect combustion efficiency and system cooling.
Air filter difference pressure (AFDP):Measures the pressure drop across the air filter. This indicates how clean or clogged the filter is—higher values suggest restricted airflow due to dirt buildup.

Gas turbine exhaust pressure (GTEP):Measures the pressure of gases exiting the gas turbine. It reflects the backpressure on the turbine and can influence its efficiency and power output.

Turbine inlet temperature (TIT):Measures the temperature of gases entering the turbine. Higher temperatures typically allow for more energy generation but require careful control to avoid material stress and damage.

Turbine after temperature (TAT):Measures the temperature of gases after they pass through the turbine. It indicates how much energy was extracted from the gases during the process.

Compressor discharge pressure (CDP):Measures the pressure of air exiting the compressor. This is a key parameter in evaluating the compressor’s performance and the efficiency of the overall cycle.

Turbine energy yield (TEY):Measures the total energy output of the turbine, typically used to assess its efficiency and production capacity over a specific period.

Carbon monoxide (CO):Measures the concentration of carbon monoxide in the exhaust gases. This is a pollutant and an indicator of incomplete combustion in the system.

Nitrogen oxides (NOx):Measures the concentration of nitrogen oxides in the exhaust gases. These are harmful pollutants formed at high temperatures during combustion and are closely monitored for environmental compliance.

## 1. Importing Necessary Libraries

In [1]:
import pandas as pd


## 2. Loading the Dataset

Load your dataset here. Make sure the dataset has at least 10,000 rows.


In [1]:
data_2011 = pd.read_csv('gt_2011.csv')
data_2012 = pd.read_csv('gt_2012.csv')
data_2013 = pd.read_csv('gt_2013.csv')
data_2014 = pd.read_csv('gt_2014.csv')
data_2015 = pd.read_csv('gt_2015.csv')

print(data_2011.head())

NameError: name 'pd' is not defined


## 3. Data Exploration

Explore the dataset by checking its dimensions, data types, and any missing values.


In [2]:
import pandas as pd
import os
file_names = ['gt_2011.csv','gt_2012.csv','gt_2013.csv','gt_2014.csv','gt_2015.csv']
combinned_data = pd.concat([pd.read_csv(file) for file in file_names], ignore_index=True)
print(combinned_data.head())
combinned_data.to_csv('combined_gt_2011_2015.csv', index=False)
print("File combined successfully!")
              
              



       AT      AP      AH    AFDP    GTEP     TIT     TAT     TEY     CDP  \
0  4.5878  1018.7  83.675  3.5758  23.979  1086.2  549.83  134.67  11.898   
1  4.2932  1018.3  84.235  3.5709  23.951  1086.1  550.05  134.67  11.892   
2  3.9045  1018.4  84.858  3.5828  23.990  1086.5  550.19  135.10  12.042   
3  3.7436  1018.3  85.434  3.5808  23.911  1086.5  550.17  135.03  11.990   
4  3.7516  1017.8  85.182  3.5781  23.917  1085.9  550.00  134.67  11.910   

        CO     NOX  
0  0.32663  81.952  
1  0.44784  82.377  
2  0.45144  83.776  
3  0.23107  82.505  
4  0.26747  82.028  
File combined successfully!



**Guiding Questions:**
- What columns are available in the data?
- Are there any columns with missing values?
- What data types are present?



## 4. Data Cleaning

Clean the data to ensure consistency and handle any missing values or outliers.


In [3]:
print(combinned_data.isnull().sum())

AT      0
AP      0
AH      0
AFDP    0
GTEP    0
TIT     0
TAT     0
TEY     0
CDP     0
CO      0
NOX     0
dtype: int64


In [6]:
import pandas as pd
df = pd.read_csv('gt_2011.csv')
df['Year'] = 2011 
df.to_csv('update_gt2011.csv', index=False)
df = pd.read_csv('gt_2012.csv')
df['Year'] = 2012 
df.to_csv('update_gt2012.csv', index=False)
df = pd.read_csv('gt_2013.csv')
df['Year'] = 2013 
df.to_csv('update_gt2013.csv', index=False)
df = pd.read_csv('gt_2014.csv')
df['Year'] = 2014 
df.to_csv('update_gt2014.csv', index=False)
df = pd.read_csv('gt_2015.csv')
df['Year'] = 2015 
df.to_csv('update_gt2015.csv', index=False)



In [7]:
import pandas as pd
import os
file_names = ['update_gt2011.csv', 'update_gt2012.csv', 'update_gt2013.csv','update_gt2014.csv','update_gt2015.csv']
combinned_data = pd.concat([pd.read_csv(file) for file in file_names], ignore_index=True)
print(combinned_data.head())
combinned_data.to_csv('combined_gt_2011_2015.csv', index=False)
print("File combined successfully!")

       AT      AP      AH    AFDP    GTEP     TIT     TAT     TEY     CDP  \
0  4.5878  1018.7  83.675  3.5758  23.979  1086.2  549.83  134.67  11.898   
1  4.2932  1018.3  84.235  3.5709  23.951  1086.1  550.05  134.67  11.892   
2  3.9045  1018.4  84.858  3.5828  23.990  1086.5  550.19  135.10  12.042   
3  3.7436  1018.3  85.434  3.5808  23.911  1086.5  550.17  135.03  11.990   
4  3.7516  1017.8  85.182  3.5781  23.917  1085.9  550.00  134.67  11.910   

        CO     NOX  Year  
0  0.32663  81.952  2011  
1  0.44784  82.377  2011  
2  0.45144  83.776  2011  
3  0.23107  82.505  2011  
4  0.26747  82.028  2011  
File combined successfully!



**Instructions:**
- Describe each cleaning step you take.
- Explain how these changes will improve the analysis.



## 5. Data Visualization

Use visualization to better understand the data and spot trends. Include at least two visualizations.


In [4]:
import matplotlib.pyplot as plt


In [None]:
# Code block for visualization 2


**Guiding Questions:**
- What patterns do you see?
- How can these patterns contribute to further insights?



## 6. Deriving Insights

In this section, derive six insights from the data. Each insight should have:
- A brief description of what you are examining.
- The code used to calculate or visualize it.
- A short analysis of the results and how they contribute to understanding the data.
- Visuals can be used to aid your insights as well.


### Insight 1

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


In [None]:
# Code for Insight


### Insight 2

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


In [None]:
# Code for Insight


### Insight 3

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>

In [None]:
# Code for Insight


### Insight 4

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>

In [None]:
# Code for Insight


### Insight 5

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>

In [None]:
# Code for Insight 


### Insight 6

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


In [1]:
# Code for Insight


## 7. Summary and Conclusions

Summarize your findings and highlight the most impactful insights you derived from the data.

**Guiding Questions:**
- What key takeaways did you learn from this data?
- How can these insights inform future decisions?
- Are there any limitations in your analysis?


**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


## 8. Projections

Finally, tell us something that can be learned from this analysis to make better decisions going forward. 

**Insight Description:** _[Describe what you are trying to learn from this analysis.]_ <br>
**Analysis:** _[Describe what this insight tells you and how it can inform decisions.]_<br>


**Instructions for Submission:**

- Submit the completed `.ipynb` file, dataset, and link to dataset in Canvas.


## Grading Critiria

Below are some of the critiria that I will be using when grading the final case. Keep in mind, no one part has a specific weight.

| **Section**          | **Description**                                                                                                                                             |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Dataset Quality**  | Does the dataset meet the length and complexity requirements?                                                                                               |
| **Data Ingestion**   | Demonstrates a well-structured process using Python to import, explore, and clean the data.                                                                 |
| **Data Cleanup**     | Clearly explains how data was cleaned (if necessary) to prepare it for analysis.                                                                            |
| **Analysis**         | - Utilizes a variety of modules and functions to analyze the cleaned dataset.                                                                               |
|                      | - You and your partner must provide at least six (6) insights derived from the analysis.                                                                    |
|                      | - Insights should reveal something not initially obvious from the raw data.                                                                                 |
|                      | - Present each insight with code execution, printed data tables, visuals, etc., and explain the process and significance.                                   |
| **Conclusion**       | Synthesizes your insights to explain the overall story the data tells.                                                                                      |
|                      | Discusses the importance of the data and general conclusions that can be drawn.                                                                             |
| **Projection**       | Takes the analysis further by considering its application in a real-world context.                                                                          |
|                      | Provides recommendations for stakeholders on how the findings can inform better, more informed decisions in the future.                                     |
|----------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Overall Presentation Quality**                     | How descriptive your explanations in the file are and the verbal presentation piece.                                    |
