
# Data Science Assignment: Exploring Documentation and Data Analysis Tools

In this assignment, you will explore three essential tools for data analysis in Python:
- **Pandas**
- **Matplotlib**
- One more Python data science package of your choice (e.g., `seaborn`, `scikit-learn`, `numpy`, etc.)

## Objectives:
1. **Research**: You will read the documentation for these libraries and explore some new or interesting features.
2. **Implementation**: You will demonstrate 5 different techniques or functionalities you have learned by analyzing a dataset of your choice.

### Steps:
1. **Explore the Pandas Documentation**: Visit the [Pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/). Look for interesting or new features that you have not yet explored. Examples could include handling missing data, merging data, or using advanced group-by functions.

2. **Explore the Matplotlib Documentation**: Visit the [Matplotlib Documentation](https://matplotlib.org/stable/contents.html). Look for advanced visualization techniques that will help you visualize your data.

3. **Select Another Library**: Choose another data science library you are interested in, such as `seaborn`, `scikit-learn`, or `numpy`. Visit its documentation, learn something new, and apply it to your dataset.

### Requirements:
- **Markdown Documentation**: For each technique or method you implement, explain what it does and why you chose it in a markdown cell before each code block.
- **Code Implementation**: After the markdown explanation, provide a code block where you implement the feature with your dataset.

---

### Example Dataset
You are free to use any dataset of your choice. If you don't have one, you can download some popular datasets from websites like [Kaggle](https://www.kaggle.com/datasets) or use in-built datasets in Python libraries like [seaborn](https://www.geeksforgeeks.org/seaborn-datasets-for-data-science/) or [sklearn](https://scikit-learn.org/1.5/datasets/real_world.html).



## Task 1: Pandas Documentation

Research and choose two interesting functions or techniques from the Pandas documentation that are useful for data analysis. You may explore new ways to handle missing data, merge datasets, or advanced grouping techniques.

### Example:
- **Pandas Method**: `pd.merge()` - Use this function to merge two dataframes.
- **Why I chose this**: Merging data from different sources is a common task in data analysis, and this function is highly efficient.



In [6]:

# Task 1: Pandas Method 1
# Code goes here: Implement a pandas method you researched from the documentation.
import numpy as np

import pandas as pd

Dinosaur_Dataframe = pd.read_csv("C:\MSBA 601-C Drive\MACC 601 Storage.csv")
Dinosaur_Dataframe_sorted =Dinosaur_Dataframe.sort_values(by = "weight_in_tons", ascending=False)
print(Dinosaur_Dataframe_sorted)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\MSBA 601-C Drive\\MACC 601 Storage.csv'

In [7]:

# Task 1: Pandas Method 2
# Code goes here: Implement another pandas method you researched from the documentation.

import numpy as np

import pandas as pd

Dinosaur_Dataframe = pd.read_csv("C:\MSBA 601-C Drive\MACC 601 Storage.csv")
Dinosaur_Dataframe_duplicates_dropped = Dinosaur_Dataframe.drop_duplicates(subset =["Dinosaur_type"])
print(Dinosaur_Dataframe_duplicates_dropped)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\MSBA 601-C Drive\\MACC 601 Storage.csv'

<mark>Type Analysis Here</mark>
Pandas 1
The raw data shows us that different types of dinosaurs with different weights and eating habits can be organized in various ways. In the case of our first pandas documentation method I decided to sort the data by the weight of each dinosaur. The code that I used tells me that we can sort data in a variety of ways for our purposes. Although I choose to do that with weight, I could have grouped the dinosaurs by carnivores or herbivores if that was my goal. The sort function allows us to organize data that could have no meaning at first and turn it into something more relevant for users of the data. 

Pandas 2
The next function allows us to better filter out unneeded information. The remove duplicates code tells me that we have aspects of our raw data that are not relevant to our analysis and simply need to be removed. In our dinosaur data we removed all the unnecessary duplicates. 
In the case of the real world we often get more information than is necessary. For example, employees sometimes make mistakes when handling payroll. If an employee is entered twice by mistake then we do not have to look through all the data to find the error using this function.



## Task 2: Matplotlib Documentation

Explore the Matplotlib documentation and implement two different types of visualizations. You may explore functions for customizing plots, adding annotations, or plotting with subplots.

### Example:
- **Matplotlib Method**: `plt.subplot()` - Create multiple subplots in a single figure.
- **Why I chose this**: Subplots allow me to visualize multiple variables side by side in one figure.



In [None]:

# Task 2: Matplotlib Visualization 1
# Code goes here: Implement a visualization from the matplotlib documentation.

import pandas as pd
import matplotlib.pyplot as plt

Sections = pd.read_csv("dinosaur_dataframe.csv")
labels = Sections["Dinosaur_type"]
weight = Sections["weight_in_tons"]
fig, ax = plt.subplots()
ax.pie(weight, labels=labels, autopct="%1.1f%%")

In [None]:

# Task 2: Matplotlib Visualization 2
# Code goes here: Implement another visualization from the matplotlib documentation.

import pandas as pd
import matplotlib.pyplot as plt

Sections = pd.read_csv("dinosaur_dataframe.csv")
labels = Sections["Dinosaur_type"]
weight = Sections["weight_in_tons"]
bar_colors = ["tab:red","tab:blue","tab:orange", "tab:green", "tab:purple"]
fig, ax = plt.subplots()
bar_container=ax.bar(labels, weight, color=bar_colors)
ax.set_ylabel("Weight in tons")
ax.set_title("Dinosaurs by Weight")
ax.set_ylim(0, 200)
ax.bar_label(bar_container)

<mark>Type Analysis Here</mark0>
Matplotlib 1
In the case of pie chart function it can help us discern portions of an overall total in a visually appealing way. Typically we use this when dealing with percentages. Pie charts help us see where proportional sections factor into the whole picture. Pie charts are very easy to implement into a business setting and are therefore seen in real world scenarios often. In the case of the dinosaur data we were able to simply see the portions of weight compared across all the types of dinosaurs. 

Matplotlib 2
With bar charts, we are able to compare different types of items very easily. We can see the comparisons in a visually simplistic manner. This type of chart is used most often in the business world. Therefore, it becomes very important to understand how to apply it for presentation purposes. Much like the pie chart we showed the comparison of weight across all the types of dinosaurs. 



## Task 3: Data Science Library of Your Choice

Choose another data science library (e.g., Seaborn, Scikit-Learn, Numpy) and research one method or technique that helps you analyze or visualize your data.

### Example:
- **Chosen Library**: `Seaborn`
- **Method**: `sns.heatmap()` - Plot heatmaps to visualize the correlation between different variables.
- **Why I chose this**: Heatmaps are useful to quickly identify relationships between features in a dataset.



In [None]:

# Task 3: Library of Your Choice - Method 1
# Code goes here: Implement a method or functionality from a library of your choice.

import seaborn as sns
sns.set_theme()
titanic = sns.load_dataset("titanic")
sns.relplot(
    data=titanic,
    x="age", y="fare", col="pclass",
    hue="sex", style="sex", size="pclass",)

<mark>Type Analysis Here</mark>
Data Science Library of Your Choice
The code that I used for the data science library of my choice revolved around scatter plots. The data I used compared the passengers' titanic fare by age and sex of passengers. I further divided it into classes with three scatter plots to map the trends among the titanic riders. This function helped visually show the trends among the different categories of data. In the real world this can help decision makers understand outliers and see growing market trends.  


## Assignment Submission & Grading

- Make sure to put your last name in the file name (this helps me with grading).
- Scoring will be on a 100 point scale (like all other assignment).
- Scoring criteria is based on how clean and complicated your code is and thoughtfulness of your analysis. 