# Task Commentary

author: Kyra Menai Hamilton

## Introduction

In this document, I will give a further commentary on each part of the code I have written in analysis.py. Please refer to the analysis.py python file for the analysis code in a more cohesive piece. In this file it will be broken down and annotated as appropriate.

## Part 1 - Importing the modules and specific tools for the data analysis.

Before any analysis can be started, importing the correct tools is essential. 

In [None]:
# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

Following the importing of module libraries the dataset needed to be sourced. This was sourced from the UC Irvine Machine Learning Repository and imported using the python import button. firt it was necessary to run the 'pip install ucimlrepo' in the terminal to install the ucimlrepo package. Following this the dataset was imported. It is important to know that the dataset was initially not imported as a DataFrame, rather as a file containing metadata and variables. the dataset was renamed 'iris' to make it easier to work with.

In [None]:
# Importing the dataset, fetch the dataset, define the data (as pandas dataframes), print metadata, and print the variable information to check that it worked.
from ucimlrepo import fetch_ucirepo 

iris = fetch_ucirepo(id=53) 

Before continuing with the analysis, saving the dataset as a .csv file for future reference was important. The dataset needed to be converted and for this to happen. First the x and y frames of the data were extracted, these were the features and targets, respectively. Then the metadata and variables were checked and changed to note form. Finally the features and targets were combined into a dataframe 'iris_df' and this was converted to a .csv using the '.to_csv' function. Upon successful completion "Iris dataset has been successfully exported to a CSV!" would be printed.

In [None]:
# data - extracting x and y (as pandas dataframes) 
x = iris.data.features 
y = iris.data.targets 

# metadata - print was to check
# print(iris.metadata) 

# variable information - print was to check
# print(iris.variables) 

# Combine the features and targets into a single DataFrame (df) so it can be exported as a CSV
iris_df = pd.concat([x, y], axis=1)

# Exporting the DataFrame (df) to a CSV file
iris_df.to_csv('D:/Data_Analytics/Modules/PandS/pands-project/iris.csv', index=False)
print("Iris dataset has been successfully exported to a CSV!") # Output - Iris dataset has been successfully exported to a CSV!

Prior to continuing with the data analysis, to ensure ease of data manipulation, the data for analysis was then inputted from the iris dataframe saved.

In [None]:
iris_df = pd.read_csv('D:/Data_Analytics/Modules/PandS/pands-project/iris.csv')

print(iris_df) # This will print the dataframe into the terminal and also gi ve a brief summary of (150 rows x 5 columns).

In order to directly save any text or plots directly to a text file:

In [None]:
# printing output directly to a txt file: https://labex.io/tutorials/python-how-to-redirect-the-print-function-to-a-file-in-python-398057

# FOR SAVING AS A TXT FILE AND APPENDING AS WE GO ON 
## First, create a file with some initial content
#with open("append_example.txt", "w") as file:
#    print("\nThis content is being added to the file.", file=file)
#    print("Appended on: X DATE", file=file)
    ## Now, append to the file
#with open("append_example.txt", "a") as file:
#    print("\nThis content is being appended to the file.", file=file)
#    print("Appended on: X DATE", file=file)
#print("Additional content has been appended to append_example.txt")
## Check the final content
#print("\nFinal content of the file:")
#with open("append_example.txt", "r") as file:    print(file.read())

Next, basic data checks were conducted and written to a text document.

In [None]:
# Basic data checks - check for missing values, duplicates, and data types
## Using the 'with' statement to handle file operations

with open("basic_data_explore.txt", "w") as file: # The (file=file) argument is important to remember as it makes sure Python knows to write to the file and not the terminal.
    print("Basic data checks:", file=file)
    print("The shape of the dataset:", file=file)
    print(iris_df.shape, file=file)
    print("The first 5 rows of the dataset:", file=file)
    print(iris_df.head(), file=file) # This will print the first 5 rows of the dataset.
    print("The last 5 rows of the dataset:", file=file)
    print(iris_df.tail(), file=file) # This will print the last 5 rows of the dataset.
    print("The column names of the dataset:", file=file)
    print(iris_df.columns, file=file) # This will print the column names of the dataset.
    
print("Basic data checks have been written to basic_data_explore.txt")

with open("basic_data_explore.txt", "a") as file:
    print("The number of rows and columns in the dataset:", file=file)
    print(iris_df.info(), file=file) # This will print the number of rows and columns in the dataset.
    print("The number of missing values in the dataset:", file=file)
    print(iris_df.isnull().sum(), file=file) # This will print the number of missing values in the dataset.
    print("The number of duplicate rows in the dataset:", file=file)
    print(iris_df.duplicated().sum(), file=file) # This will print the number of duplicate rows in the dataset.
    print("The data types of each column in the dataset:", file=file)
    print(iris_df.dtypes, file=file) # This will print the data types of each column in the dataset.print("This is the initial content of the file.", file=file)
    
print("Basic data checks have been appended to basic_data_explore.txt")

Before conducting additional data analysis, it is important to remove any duplicate variables.

In [None]:
# Need to make sure tha any duplicates are removed and that the data types are correct before conducting any analysis.
# Already checked for missing values and we know there are 0, but there are 3 duplicate rows in the dataset.

data = iris_df.drop_duplicates(subset="class",) # This will remove any duplicate rows in the dataset, based on the class(species) column.