# Program analysis.py

***

The program analyses the data set of Iris flowers. To better understand data in the data set, check the picture.

<img src="img/Three-species-of-IRIS-flower.png"  width="635" height="313" img align='center'>

The picture illustrates what is the sepal and the petal for each species. It was originally uploaded by Shohal Hossain on the [ResearchGate website](https://www.researchgate.net/figure/Three-species-of-IRIS-flower_fig1_367220930) and may be subject to copyright.

### Project assumptions. 

The program is based on a five-position menu. The user can choose between: generate a summary of the data sets in the form of a Txt file, create histograms based on the available data, display scatter plots of individual sets of variables and perform analysis (in this case, it will be the calculation of a regression matrix using the Pears method). To close the program, it is necessary to use quit.

The code of program starts by importing necessary libraries such as [pandas](https://pandas.pydata.org), [NumPy](https://numpy.org) and [matplotlib](https://matplotlib.org). 
Click links to go to official library sites.

### Loading data to the program.

Next the data set is loaded into the program.
It is realised by set variable pd and use pd.csv_read(*filepath_or_buffer*)
Information about pd.csv_read on [the pandas library website](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#).


### The main program menu.

The menu code is adapted from Andrew Beatty's [code](https://github.com/andrewbeattycourseware/pands-course-material/blob/main/code/topic06-functions/lab06.09-walkthrough.py) for the Programming and Scripting module.

The whole idea of this menu is to allow users to choose what to do by using dedicated letters.

The function displayMenu() contains in this case, six print() statements with descriptions and options to choose from. The function then reads user input using the input() function and stores it in a variable named choice. Finally, it returns the value of the choice variable.

The condition of the while loop is that the program will continue to run until the user selects "q". If "q" is selected, the program prints a dedicated statement and finishes working.

Inside the while loop is located if-elif-else statement (more on [Programiz website](https://www.programiz.com/python-programming/if-elif-else)), which is used when there is a need to choose between more than two alternatives. The program checks when conditions are True and then realizes the related block of code. In this case, the else statement is used to prevent errors.



### writeSummaryToFile() function.

The program has a new function called writeSummaryToFile(), which takes the file's name as an argument. 

The *with* statement works with the open() function to open a file. With access code w (write only), the file is open for writing. If the file does not exist, it is created, but if it does exist, it is overwritten.
<br>In this case, the syntax is:
```
with open(filename, "w") as file:
```

To create content for the summary file, the code is used:
```
file.write("Content")
```

More information about opening file on [Freecodecamp website](https://www.freecodecamp.org/news/with-open-in-python-with-statement-syntax-example/) and on [GeeksforGeeks website](https://www.geeksforgeeks.org/writing-to-file-in-python/).

In the summary.txt file to find out some basic information about the data set is used:
* pd.shape - the shape of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html#).
* pd.dtype - the data types of each column.Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html#).
* pd.head() - the first 5 rows of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#).
* pd.tail() - the last 5 rows of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#). 
* pd.describe() - the statistics summary of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#).
* pd["species"].value_counts() - the number of each species in the dataset. Read more on [GeeksforGeeks website](https://www.geeksforgeeks.org/python-basics-of-pandas-using-iris-dataset/).
* pd.isnull().sum() - the number of missing values in the dataset. Read more on [Medium website](https://medium.com/@mahim1066/when-you-execute-df-isnull-sum-4a53cf89390c).
* pd.nunique() - the number of unique values in the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html#).
* pd.duplicated().sum() - number of duplicate rows in the dataset. Read more on [Begin Coding Now website](https://begincodingnow.com/duplicate-rows-in-pandas/).
* pd.duplicated(keep=False) - if keep set on False, all duplicates are True. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html#).

Information about summary data in Python on [Learn Python Website](https://learnpython.com/blog/how-to-summarize-data-in-python/).



The function is located inside the if loop with condition <font color="lightblue">choice == "s":</font> in the main program.

### saveHist() function.

To realise the option of saving histograms to individual PNG files, a saveHist() function is created.

The function is taken with arguments: 
* data (pandas.Series), which are data for the variable.
* variable_name (str) is the name of the variable.
* num_bins (int, optional) is the number of bins for the histogram. The default value is set to 10.

Inside the function is used plt.figure(figuresize = ) function which allows to specify the width and height of the figure in inches. Read more on [Codedamn website](https://codedamn.com/news/python/change-matplotlib-figure-and-plot-size#understanding_pltfigsize).

To set a correct number of ticks on the x-axis, the bin width is counted as the difference between the maximum and minimum values divided by the number of bins. This calculation makes it possible to find evenly spaced values within a given interval using np.arange() function. Read more on [NumPy Developers website](https://numpy.org/doc/stable/reference/generated/numpy.arange.html). 
<br>Particularly in this case, it is necessary to increase the stop value by the step value  to be sure that the stop value is include. Read more on [Statology website](https://www.statology.org/numpy-arange-include-endpoint/).


Ticks are made using the plt.ticks() function. Read more on[Matplotlib website](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xticks.html#).

The plt.hist() functions plot a histogram with arguments such as pd(data), bins, color, and edgecolor. 
The plt.title() function creates the title. plt.xlabel() and plt.ylabel() functions create labels to x-axis and y-axis. 
The plt.save() function saves the plot to the desired PNG file, and the plt.close() function closes the plot. 
<br>Read more about how to make the histogram on [GeeksforGeeks website](https://www.geeksforgeeks.org/plotting-histogram-in-python-using-matplotlib/).


The function is located inside the elif loop with condition <font color="lightblue">choice == "h":</font> in the main program. 
In this case, the function is called separately for each variable.

### scatter() function.

To realise the option of plotting scatter plots of each pair of variables, a scatter() function is created.

The function is taken with arguments:
* pd (pandas.DataFrame): The DataFrame containing the data.
* x (str): Name of the x-axis feature.
* y (str): Name of the y-axis feature.
* FigureName (str): Name of the figure.

Inside the function is a dictionary associating species names and colours. This dictionary is helpful for visualizing data related to these species in scatter plots. 
More information about dictionaries on [W3School website](https://www.w3schools.com/python/python_dictionaries.asp).

plt.figure(figuresize =, num = ) function is used to specify the width and height of the figure in inches and num is used to give a name for the figure. Read more on [Codedamn website](https://codedamn.com/news/python/change-matplotlib-figure-and-plot-size#understanding_pltfigsize).

The for species in species_mapping: line initiates a loop that iterates through each key (species name) in the species_mapping dictionary. During each iteration, the species variable takes on the current species name. More about iterates through a dictionary on [Real Python website](https://realpython.com/iterate-through-dictionary-python/).

Inside the loop, is created a subset of the DataFrame pd based on the condition:
```
pd['species'] == species
```
This condition filters the rows where the ‘species’ column matches the current species value from the loop.
The resulting subset_pd DataFrame contains only the rows corresponding to the current species being processed in the loop.

The plt.scatter(subset_pd[x], subset_pd[y], label=species) line creates a scatter plot. The legend is automatically generated based on the label argument provided in the scatter plot. It shows the mapping between species names and their corresponding data points in the scatter plot. The plt.title() function creates the title. plt.xlabel() and plt.ylabel() functions create labels to x-axis and y-axis.


The function is located inside the elif loop with condition <font color="lightblue">choice == "p":</font> in the main program. 
In this case, the function is called separately for each set of variables.

# analysisFunctions.py

***

The analysisFunctions.py file contains functions used to perform the analysis, making the code in the main program easier to read and well-organized. This helps to reduce the number of lines in the main program.

### analyseCorrelation() function.

To realise the option of analysing correlation based on the correlation table, the analyseCorrelation() function is created.

The function is taken with arguments:
* outputFileName: the name of the output file.

Inside the function, is created a correlation matrix. This is a table showing the correlation coefficients between all variables. Every cell in the table presents a correlation value for a pair of variables. 

The ```df.drop(columns=["species"])``` 
line removes the "species" column from the DataFrame, and the corr() function calculates the correlation matrix. More information about corr() function on [Pandas website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html#). More about how to use drop() on [Pandas website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#).

Inside the function is a dictionary named correlationLevels. It’s used to map descriptions of correlation levels to their corresponding numerical ranges.
More information about dictionaries on [W3School website](https://www.w3schools.com/python/python_dictionaries.asp). More about tuples on [GeeksforGeeks website](https://www.geeksforgeeks.org/python-create-dictionary-of-tuples/).


The *with* statement works with the open() function to open a file. With access code w (write only), the file is open for writing. If the file does not exist, it is created, but if it does exist, it is overwritten.
<br>In this case, the syntax is:
```
with open(filename, "w") as file:
```

To create content for the summary file, the code is used:
```
file.write("Content")
```

This block of code is used to iterate over a correlation matrix.

`for col in correlationMatrix.columns:` This line starts a loop that iterates over each column in the correlation matrix. The variable col represents the current column. More about DataFrame columns Property on [W3Schools website](https://www.w3schools.com/python/pandas/ref_df_columns.asp). More about iteration over columns on [SparkByExamples website](https://sparkbyexamples.com/pandas/pandas-iterate-over-columns-of-dataframe-to-run-regression/#:~:text=Toggle%20website%20search-,Pandas%20Iterate%20Over%20Columns%20of%20DataFrame,-Home%20%C2%BB%20Pandas).

`for row in correlationMatrix.index:` Inside the column loop, this line starts another loop that iterates over each row in the correlation matrix. The variable row represents the current row. More about DataFrame index Property on [W3Schools website](https://www.w3schools.com/python/pandas/ref_df_index.asp#:~:text=The%20index%20property%20returns%20the,%2C%20stop%2C%20and%20step%20values.). More about iteration over rows on [GeeksforGeeks website](https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/).

`if row < col:` This line checks if the current row index is less than the current column index. This is done to avoid duplicate pairs because the correlation between A and B is no different from the correlation between B and A.

`value = correlationMatrix.loc[row, col]` This line gets the correlation value between the current row and column from the correlation matrix. More about Pandas iloc and loc on [Shane Lynn website](https://www.shanelynn.ie/pandas-iloc-loc-select-rows-and-columns-dataframe/).

`for level, (minVal, maxVal) in correlationLevels.items()`: This line starts another loop that iterates over each item in the correlationLevels dictionary. The variable level represents the current key (correlation level), and (minVal, maxVal) represents the current value (a tuple carring the minimum and maximum values for this level). More about tterate through a dictionary on [Real Python Website](https://realpython.com/iterate-through-dictionary-python/).

`if minVal <= abs(value) <= maxVal:` This line checks if the absolute value of the correlation is within the current level’s range. More about abs() on [W3Schools website](https://www.w3schools.com/python/ref_func_abs.asp).

`correlationType = "positive" if value > 0 else "negative"` This line determines whether the correlation is positive or negative based on the sign of the value.
The next few lines write a detailed description of the correlation between the current pair of features to the file.

`file.write("\n\n")` Finally, this line writes two newline characters to the file, serving as a separator between different pairs of features.

The function is located inside the elif loop with condition <font color="lightblue">choice == "a":</font> in the main program. To call the function, use the function's name with the file name where the correlation analysis will be saved as an argument.

```analyseCorrelation("analysis.txt")```

### writeStatsBySpecies() function.

To realise the option of analysing statistics such as mean, median and standard deviation by species, the writeStatsBySpecies() function is created.

The function is taken with arguments:
* filename: the name of the output file.

Inside the function, a new DataFrame statsBySpecies is created, where the index is the unique species from the 'species' column. The columns are multi-indexed, with the top level being the original column names and the second level being one of 'mean', 'median', or 'std'. Each cell in the DataFrame represents the corresponding statistic for that species and column.

```df.groupby('species')``` groups the DataFrame df by the ‘species’ column.

```.agg(['mean', 'median', 'std'])``` the agg function is used to calculate the mean (average), median (middle value), and the standard deviation for each species in the DataFrame. 

More about DataFrameGroupBy.agg on [Pandas Documentation](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.core.groupby.DataFrameGroupBy.agg.html).

To understand the structure of the function, review the code:

```with open(filename, 'a') as file:``` This line opens a file with the name stored in the variable filename in ‘append’ mode ('a'). More about with open on [Programiz website](https://www.programiz.com/python-programming/methods/built-in/open).
 
```file.write(f"The title\n\n")``` This line writes a title for the analysis to the file. More about writing to file on [note.nkmk.me website](https://note.nkmk.me/en/python-file-io-open-with/#:~:text=source%3A%20file_io_with_open.py-,Write%20a%20string%3A%20write(),-To%20write%20a).

```for species, row in statsBySpecies.iterrows():``` This line starts a loop that goes through each row of the statsBySpecies DataFrame. For each row, it assigns the index value to species and the data in the row to row. More about Pandas DataFrame iterrows() Method on [W2Schools website](https://www.w3schools.com/python/pandas/ref_df_iterrows.asp).

```file.write(f"{species.capitalize()}:\n")``` This line writes the species name followed by a colon and a newline character to the file.

```for col in ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']:``` This line starts another loop inside the first one that goes through each of the listed column names. More about For Loops in Python on [DataCamp website](https://www.datacamp.com/tutorial/for-loops-in-python#:~:text=Let%27s%20say%20you%20want%20to%20define%20a%20list%20of%20elements%20and%20iterate%20over%20those%20elements%20one%20by%20one.).

```file.write(f"  {col.capitalize().replace('_', ' ')}: mean={row[(col, 'mean')]:.2f}, median={row[(col, 'median')]:.2f}, std={row[(col, 'std')]:.2f}\n")``` This line writes the column name, followed by the mean, median, and standard deviation of the data in that column for the current species. Each value is rounded to two decimal places. Sample of code ```mean={row[(col, 'mean')]:.2f}``` will output a string that starts with “mean=”, followed by the mean value of the specified column col in the DataFrame row, formatted as a float with two decimal places.

```file.write("n")```This line writes a newline character to the file, effectively adding a blank line. This separates the data for each species, making the file easier to read.

The function is located inside the elif loop with condition <font color="lightblue">choice 
== "a":</font> in the main program. To call the function, use the function's name with the 
file name where the results of mean, median and standard deviation will be saved as an argument.

```writeStatsBySpecies('analysis.txt')```

### findOutliers() function.

This function adds extra analysis to existing file called: analysis.txt file.

The function is taken with arguments:
* df: the data frame to search for outliers.
* filename: the name of the output file.

To understand the structure of the function, review the code:


```def findOutliers(df, filename):``` This line defines a function named findOutliers, which takes two arguments: a DataFrame df and a string filename. More about Python Functions on [DataCamp website](https://www.datacamp.com/tutorial/functions-python-tutorial).

```with open(filename, 'a') as f:```The line opens a file for appending data and automatically closes it when done. The file is asigned to as f within the block. More about writing to file on [note.nkmk.me website](https://note.nkmk.me/en/python-file-io-open-with/#:~:text=source%3A%20file_io_with_open.py-,Append%20to%20a%20file,-Open%20a%20file).
    

```f.write("Outliers by species for the Iris Data set.\n\n")``` This line writes the title to file f.

```for species in df['species'].unique():``` This line starts a loop over each unique value in the ‘species’ column of the DataFrame df. More about using unique() method on [Favtutor website](https://favtutor.com/blogs/pandas-unique-values-in-column#:~:text=How%20to%20Get-,Unique,-Values%20in%20DataFrame).

```speciesDF = df[df['species'] == species]``` This line creates a new DataFrame speciesDF which includes only the rows where the 'species' column corresponds to the current species. More about filtering DataFrame on [datagy website](https://datagy.io/filter-pandas/#:~:text=to%20a%20Specific-,String,-If%20you%20want).

```numericColumns = speciesDF.select_dtypes(include=['float64', 'int64'])``` This line selects only the columns of speciesDF with numeric data types (float64 and int64), and assigns the resulting DataFrame to numericColumns. More about selecting numeric data from DataFrame on [Pandas website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html#).

```Q1 = numericColumns.quantile(0.25)```, ```Q3 = numericColumns.quantile(0.75)``` These lines calculate the first quartile (Q1) and third quartile (Q3) of the numeric columns, correspondingly.

```IQR = Q3 - Q1``` This line calculates the interquartile range (IQR), described as the range between the first and third quartiles.

```outliers = speciesDF[(numericColumns < (Q1 - 1.5 * IQR)) | (numericColumns > (Q3 + 1.5 * IQR))]``` This line findings outliers in speciesDF. Outliers are values that fall below Q1 - 1.5IQR or above Q3 + 1.5IQR.

More about finding outliers using statistical methods on [CareerFoundry website](https://careerfoundry.com/en/blog/data-analytics/how-to-find-outliers/#:~:text=the%20scatter%20plot.-,Finding%20outliers%20using%20statistical%20methods,-Since%20the%20data).

```outliers = outliers.dropna(how='all')```  This line filters out rows from the 'outliers' data where all the columns contain NaN values. More about how to using dropna() on [Pandas website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html#).

```if not outliers.empty:``` This line checks if the outliers DataFrame is empty. If it is not, the code within the if block is executed. More about it on [Pandas website](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.empty.html#).

```f.write(f"Species: {species.capitalize()}\n\n")``` This line writes the name of the current species with the first letter capitalized to the file f.

```for column in outliers.columns:``` This line starts a loop over each column in the outliers DataFrame. More about Pandas DataFrame.columns on [GeeksforGeeks website](https://www.geeksforgeeks.org/python-pandas-dataframe-columns/).

```outlierValues = outliers[column].dropna()``` This line gets the values in the current column of outliers that are not NaN and assigns them to outlierValues.

```if not outlierValues.empty:``` This line checks if outlierValues is empty. If it is not, the code within the if block is executed.

```f.write(f"   Feature: {column.capitalize().replace('_', ' ')}\n")``` This line writes the name of the current feature (i.e., column name) to the file f, with the first letter capitalized and underscores replaced with spaces.

```f.write("   Outlier Values: " + ', '.join([str(value) for value in outlierValues.values]) + "\n\n")``` This line writes the outlier values for the current feature to the file f. This will convert all the values in outlierValues.values to strings and join them into a single string, with each value separated by a comma and a space. The + "\n\n" is used to add two newline characters at the end of your string. In Python, the newline character (\n) is used to start a new line. So, "\n\n" creates a blank line.

The function is located inside the elif loop with condition <font color="lightblue">choice 
== "a":</font> in the main program. To call the function, use the function's name with the file name where the outliners will be saved as an argument. df is declared in the analysisFunctions.py.


```findOutliers(df, "analysis.txt")```

Please note that three functions have been added to analysisFunctions.py. The first function creates a text file named analysis.txt. The other functions just add analysis to an existing file.

# Redesigning program structure.

***

Initially, the program was designed with the following functionality: the user launches the main program, and a menu appears with the following options:
* s: Generates a text file summarizing the dataset about irises.
* h: Creates histograms for individual values and saves them as PNG files.
* p: Generates and displays scatter plots for various pairs of data.
* a: Conducts the following analyses and saves it to the analysis.txt file:
     - Correlation analysis.
     - Mean, median, and standard deviation calculations for each species.
     - Identification of outliers for each species.
* q: Quits the program.

Important notice: Users should note that scatter plots are not generated simultaneously. To display another scatter plot, the currently displayed one must be closed.

If an option other than ‘q’ is selected, the associated actions are executed, and then a menu is displayed for reselection. Choosing ‘q’ only results in program exit. 

The program structure consists of the main file called Analysis.py and a separate file called Analysisfunctions.py, which contains functions used for analysis. To this file was moved the writeSummaryToFile() function. With this move code of the main program looks clearer. To significantly improve the readability of the code, a new file called Plottingfunctions.py was created, and the functions initially located in the main program were moved there. This modular approach improves the organization and makes the code more readable.

During program development, several ideas emerged, such as:
* displaying information where the user can find the scatter plot and how to cooperate with plots. This helps the user understand what is happening and how to use it.
* displaying a comment after selecting an option and waiting for the user to press any key.Without this option, there isn’t sufficient clarity despite showing the message, and the user might overlook the comment. 
* reduce the code used to call functions. The easiest way to process repeatable data, like in the plotting functions, is to create basic control functions. This also makes code simpler.
* call all the necessary functions used for analysis by one simple function.
* avoiding errors related to accidentally using capital letters. It is realised by using the .lower() function.
* Improving interactions with the user by using clearer messages. Straightforward messages should help the users use the program more naturally. Intuitively guiding the user through the program options can enhance the experience and make the program more user-friendly.

### New functions about which was told previously.
***

### pause() function.

The role of this function is to wait until the user presses any key. This is used to help the user see messages showing what has happened after choosing an option.

The function is not taken any argumnets.

The function contains:

```print("Press any key to come back to the menu.\n")``` This line instructs the program to print the message "Press any key to come back to the menu." to the console. The \n at the end of the string is a newline character, which causes anything printed afterwards to appear on a new line.

```msvcrt.getch()``` This line is calling a function named getch() from the msvcrt module. The getch() function requires the user to press a key and then return the character of the pressed key. However, it does not display this character in the console.

Please notice that the msvcrt module is a Windows-specific module in Python.

The function is nested in the main program inside a loop if-else-else in the blocks of code that executes after the chosen option is selected.

This is called by

```pause()```

### analysisOneCall().

The function's role is to make the call of previous functions in the main program.

The function is not taken any arguments.

The function includes three calls of existing analysis functions:
* ```analyseCorrelation("analysis.txt")```
* ```writeStatsBySpecies("analysis.txt")```                                       
* ```findOutliers(df, "analysis.txt")```


***

### End