# Program analysis.py

***

The program analyses the data set of Iris flowers. To better understand data in the data set, check the picture.

<img src="img/Three-species-of-IRIS-flower.png"  width="635" height="313" img align='center'>

The picture illustrates what is the sepal and the petal for each species. It was originally uploaded by Shohal Hossain on the [ResearchGate website](https://www.researchgate.net/figure/Three-species-of-IRIS-flower_fig1_367220930) and may be subject to copyright.

### Project assumptions. 

The program is based on a five-position menu. The user can choose between: generate a summary of the data sets in the form of a Txt file, create histograms based on the available data, display scatter plots of individual sets of variables and perform analysis (in this case, it will be the calculation of a regression matrix using the Pears method). To close the program, it is necessary to use quit.

The code of program starts by importing necessary libraries such as [pandas](https://pandas.pydata.org), [NumPy](https://numpy.org) and [matplotlib](https://matplotlib.org). 
Click links to go to official library sites.

### Loading data to the program.

Next the data set is loaded into the program.
It is realised by set variable pd and use pd.csv_read(*filepath_or_buffer*)
Information about pd.csv_read on [the pandas library website](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#).


### The main program menu.

The menu code is adapted from Andrew Beatty's [code](https://github.com/andrewbeattycourseware/pands-course-material/blob/main/code/topic06-functions/lab06.09-walkthrough.py) for the Programming and Scripting module.

The whole idea of this menu is to allow users to choose what to do by using dedicated letters.

The function displayMenu() contains in this case, six print() statements with descriptions and options to choose from. The function then reads user input using the input() function and stores it in a variable named choice. Finally, it returns the value of the choice variable.

The condition of the while loop is that the program will continue to run until the user selects "q". If "q" is selected, the program prints a dedicated statement and finishes working.

Inside the while loop is located if-elif-else statement (more on [Programiz website](https://www.programiz.com/python-programming/if-elif-else)), which is used when there is a need to choose between more than two alternatives. The program checks when conditions are True and then realizes the related block of code. In this case, the else statement is used to prevent errors.



### writeSummaryToFile() function.

The program has a new function called writeSummaryToFile(), which takes the file's name as an argument. 

The *with* statement works with the open() function to open a file. With access code w (write only), the file is open for writing. If the file does not exist, it is created, but if it does exist, it is overwritten.
<br>In this case, the syntax is:
```
with open(filename, "w") as file:
```

To create content for the summary file, the code is used:
```
file.write("Content")
```

More information about opening file on [Freecodecamp website](https://www.freecodecamp.org/news/with-open-in-python-with-statement-syntax-example/) and on [GeeksforGeeks website](https://www.geeksforgeeks.org/writing-to-file-in-python/).

In the summary.txt file to find out some basic information about the data set is used:
* pd.shape - the shape of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.shape.html#).
* pd.dtype - the data types of each column.Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dtypes.html#).
* pd.head() - the first 5 rows of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html#).
* pd.tail() - the last 5 rows of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.tail.html#). 
* pd.describe() - the statistics summary of the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#).
* pd["species"].value_counts() - the number of each species in the dataset. Read more on [GeeksforGeeks website](https://www.geeksforgeeks.org/python-basics-of-pandas-using-iris-dataset/).
* pd.isnull().sum() - the number of missing values in the dataset. Read more on [Medium website](https://medium.com/@mahim1066/when-you-execute-df-isnull-sum-4a53cf89390c).
* pd.nunique() - the number of unique values in the dataset. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.nunique.html#).
* pd.duplicated().sum() - number of duplicate rows in the dataset. Read more on [Begin Coding Now website](https://begincodingnow.com/duplicate-rows-in-pandas/).
* pd.duplicated(keep=False) - if keep set on False, all duplicates are True. Read more on [Pandas official website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html#).

Information about summary data in Python on [Learn Python Website](https://learnpython.com/blog/how-to-summarize-data-in-python/).



The function is located inside the if loop with condition <font color="lightblue">choice == "s":</font> in the main program.

### saveHist() function.

To realise the option of saving histograms to individual PNG files, a saveHist() function is created.

The function is taken with arguments: 
* data (pandas.Series), which are data for the variable.
* variable_name (str) is the name of the variable.
* num_bins (int, optional) is the number of bins for the histogram. The default value is set to 10.

Inside the function is used plt.figure(figuresize = ) function which allows to specify the width and height of the figure in inches. Read more on [Codedamn website](https://codedamn.com/news/python/change-matplotlib-figure-and-plot-size#understanding_pltfigsize).

To set a correct number of ticks on the x-axis, the bin width is counted as the difference between the maximum and minimum values divided by the number of bins. This calculation makes it possible to find evenly spaced values within a given interval using np.arange() function. Read more on [NumPy Developers website](https://numpy.org/doc/stable/reference/generated/numpy.arange.html). 
<br>Particularly in this case, it is necessary to increase the stop value by the step value  to be sure that the stop value is include. Read more on [Statology website](https://www.statology.org/numpy-arange-include-endpoint/).


Ticks are made using the plt.ticks() function. Read more on[Matplotlib website](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.xticks.html#).

The plt.hist() functions plot a histogram with arguments such as pd(data), bins, color, and edgecolor. 
The plt.title() function creates the title. plt.xlabel() and plt.ylabel() functions create labels to x-axis and y-axis. 
The plt.save() function saves the plot to the desired PNG file, and the plt.close() function closes the plot. 
<br>Read more about how to make the histogram on [GeeksforGeeks website](https://www.geeksforgeeks.org/plotting-histogram-in-python-using-matplotlib/).


The function is located inside the elif loop with condition <font color="lightblue">choice == "h":</font> in the main program. 
In this case, the function is called separately for each variable.

### scatter() function.

To realise the option of plotting scatter plots of each pair of variables, a scatter() function is created.

The function is taken with arguments:
* pd (pandas.DataFrame): The DataFrame containing the data.
* x (str): Name of the x-axis feature.
* y (str): Name of the y-axis feature.
* FigureName (str): Name of the figure.

Inside the function is a dictionary associating species names and colours. This dictionary is helpful for visualizing data related to these species in scatter plots. 
More information about dictionaries on [W3School website](https://www.w3schools.com/python/python_dictionaries.asp).

plt.figure(figuresize =, num = ) function is used to specify the width and height of the figure in inches and num is used to give a name for the figure. Read more on [Codedamn website](https://codedamn.com/news/python/change-matplotlib-figure-and-plot-size#understanding_pltfigsize).

The for species in species_mapping: line initiates a loop that iterates through each key (species name) in the species_mapping dictionary. During each iteration, the species variable takes on the current species name. More about iterates through a dictionary on [Real Python website](https://realpython.com/iterate-through-dictionary-python/).

Inside the loop, is created a subset of the DataFrame pd based on the condition:
```
pd['species'] == species
```
This condition filters the rows where the ‘species’ column matches the current species value from the loop.
The resulting subset_pd DataFrame contains only the rows corresponding to the current species being processed in the loop.

The plt.scatter(subset_pd[x], subset_pd[y], label=species) line creates a scatter plot. The legend is automatically generated based on the label argument provided in the scatter plot. It shows the mapping between species names and their corresponding data points in the scatter plot. The plt.title() function creates the title. plt.xlabel() and plt.ylabel() functions create labels to x-axis and y-axis.


The function is located inside the elif loop with condition <font color="lightblue">choice == "p":</font> in the main program. 
In this case, the function is called separately for each set of variables.

# analysisFunctions.py

***

The analysisFunctions.py file contains functions used to perform the analysis, making the code in the main program easier to read and well-organized. This helps to reduce the number of lines in the main program.

### analyseCorrelation() function.

To realise the option of analysing correlation based on the correlation table, the analyseCorrelation() function is created.

The function is taken with arguments:
* outputFileName: the name of the output file.

Inside the function, is created a correlation matrix. This is a table showing the correlation coefficients between all variables. Every cell in the table presents a correlation value for a pair of variables. 

The ```df.drop(columns=["species"])``` 
line removes the "species" column from the DataFrame, and the corr() function calculates the correlation matrix. More information about corr() function on [Pandas website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html#). More about how to use drop() on [Pandas website](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html#).

Inside the function is a dictionary named correlationLevels. It’s used to map descriptions of correlation levels to their corresponding numerical ranges.
More information about dictionaries on [W3School website](https://www.w3schools.com/python/python_dictionaries.asp). More about tuples on [GeeksforGeeks website](https://www.geeksforgeeks.org/python-create-dictionary-of-tuples/).


The *with* statement works with the open() function to open a file. With access code w (write only), the file is open for writing. If the file does not exist, it is created, but if it does exist, it is overwritten.
<br>In this case, the syntax is:
```
with open(filename, "w") as file:
```

To create content for the summary file, the code is used:
```
file.write("Content")
```

This block of code is used to iterate over a correlation matrix.

`for col in correlationMatrix.columns:` This line starts a loop that iterates over each column in the correlation matrix. The variable col represents the current column. More about DataFrame columns Property on [W3Schools website](https://www.w3schools.com/python/pandas/ref_df_columns.asp). More about iteration over columns on [SparkByExamples website](https://sparkbyexamples.com/pandas/pandas-iterate-over-columns-of-dataframe-to-run-regression/#:~:text=Toggle%20website%20search-,Pandas%20Iterate%20Over%20Columns%20of%20DataFrame,-Home%20%C2%BB%20Pandas).

`for row in correlationMatrix.index:` Inside the column loop, this line starts another loop that iterates over each row in the correlation matrix. The variable row represents the current row. More about DataFrame index Property on [W3Schools website](https://www.w3schools.com/python/pandas/ref_df_index.asp#:~:text=The%20index%20property%20returns%20the,%2C%20stop%2C%20and%20step%20values.). More about iteration over rows on [GeeksforGeeks website](https://www.geeksforgeeks.org/different-ways-to-iterate-over-rows-in-pandas-dataframe/).

`if row < col:` This line checks if the current row index is less than the current column index. This is done to avoid duplicate pairs because the correlation between A and B is no different from the correlation between B and A.

`value = correlationMatrix.loc[row, col]` This line gets the correlation value between the current row and column from the correlation matrix. More about Pandas iloc and loc on [Shane Lynn website](https://www.shanelynn.ie/pandas-iloc-loc-select-rows-and-columns-dataframe/).

`for level, (minVal, maxVal) in correlationLevels.items()`: This line starts another loop that iterates over each item in the correlationLevels dictionary. The variable level represents the current key (correlation level), and (minVal, maxVal) represents the current value (a tuple carring the minimum and maximum values for this level). More about tterate through a dictionary on [Real Python Website](https://realpython.com/iterate-through-dictionary-python/).

`if minVal <= abs(value) <= maxVal:` This line checks if the absolute value of the correlation is within the current level’s range. More about abs() on [W3Schools website](https://www.w3schools.com/python/ref_func_abs.asp).

`correlationType = "positive" if value > 0 else "negative"` This line determines whether the correlation is positive or negative based on the sign of the value.
The next few lines write a detailed description of the correlation between the current pair of features to the file.

`file.write("\n\n")` Finally, this line writes two newline characters to the file, serving as a separator between different pairs of features.

The function is located inside the elif loop with condition <font color="lightblue">choice == "a":</font> in the main program. To call the function, use the function's name with the file name where the correlation analysis will be saved.


***

### End