## Python '`1-liners`' to start you on your data exploration journey.

<div class = "alert alert-info" role = "alert" 
     style = "font-size: 1.3em; padding: 15px; margin: 0px 0; text-align: center">
    
    In this tutorial, we will be kicking off exploratory data analysis
    using 5 simple in-built functions.

</div>

### Basic Setup ▶▶ always first import the Python `libraries` you'll need

In [None]:
# Data Handling and Manipulation
import os
import numpy as np
import pandas as pd

# Data Visualisation and Plots
import matplotlib.pyplot as plt

# Supress warnings (not errors)
import warnings
warnings.filterwarnings('ignore')

### Second, import the data for this practical using the Python `pandas` `read_csv` function

In [None]:
# Step 1: - Give directions to your data (set the path) 
#         - Give your long filename a short name, file

path = "data/"
file = 'P1_TestData_SeaSurface_Variables.csv'


In [None]:
# Step 2: Import your data into pandas dataframe

dataframe = pd.read_csv(os.path.join(path, file),    # Python will follow your directions and looks for your filename
                        delimiter = ',',             # CSV: comma delimited values, tells Python to separate by comma
                        comment   = '#')             # Just in case, add hashtag so any descriptive text gets ignored


<div class="alert alert-info" role="alert"
     style="color:#000; font-size:1em; background-color:white; padding:10px; margin:1px; text-align:left;">

#### ⇑⇓ Quick Reminder ⇑⇓

     
    Remember the all-powerful and ever-useful 'print'?
    Let's use it to look inside the dataframe we just made with data imported from our computer.
</div>

In [None]:
# Inspect your dataframe using 'print'

print(dataframe)


<div class="alert alert-info" 
    style="font-size: 1em; padding: 11px; margin: 0px 0; text-align: left">
    
     That looks good! 
     Now we can go through 5 valuable 1-liners to explore your data in depth.
    
      1) info() 
      2) describe()
      3) min() and max()
      4) value_counts()
      5) corr()
      6) scatter() with polyfit 
      7) savefig()
      
</div>

----
### 1. Python's `info`

In [None]:
# Info Dump of everything you need to know about your data!
# - Counts: RangeIndex 0 to 23 (24 total)
# - Non-Nulls: if NaNs are present, how many
# - Dtype: data type for each column

dataframe.info()


<div class="alert alert-info" role="alert"
     style="color:#000; font-size:1em; background-color:white; padding:10px; margin:1px; text-align:left;">

#### ⇑⇓ Quick Reminder ⇑⇓
     
     Data Type is key! 
     Common types that you will come across include:
     
       • Text Type Data (words) : str
       • Numeric Type (numbers) : int, float
       • Boolean (true / false) : bool

     By default, 'dates' are 'object' data type. 
     The pandas to_datetime() function converts dates/ times to 'datetime64[ns]' data type, so it is no longer a generic 
     object that Python may struggle to work with.
     
     We will apply the to_datetime() function later so that we can plot one of the variables in our dataframe over time 
     (optional).
</div>

### 2. Python's `describe`

In [None]:
# Outputs description of the numeric data in your DataFrame
# - including: min, max, mean, and more

dataframe.describe()


<div class = "alert alert-info" role = "alert" 
     style = "color:#000; font-size: 1.1em; padding: 10px; margin: 0px 0; text-align: left">


**☆ Your Input Needed Here: ☆**
    
In this line of code, you will need to change `'Column Name Here'` to `'Temperature'`, `'Height'`, or `'Salinity'`

</div>

In [None]:
# Select specific column from the dataframe

dataframe[['Column Name Here']].describe()


### 3. Python's `min()` and `max()`

In [None]:
# Alternatively, ask for the features you want like min/max 
# from the specified numeric data column in your dataframe.

# In this case, we selected 'Temperature'
temp_min = dataframe['Temperature'].min()
temp_max = dataframe['Temperature'].max()

# Output min and max temp
print(temp_min, temp_max)

### 4. Python's `value_counts`

In [None]:
# In ascending order, output how often values in a specified numeric data column repeat.
# In this case, we select 'Height'

dataframe['Height'].value_counts()


### 5. Python's `corr` for correlation - `r` and `R²`

<div class="alert alert-info" role="alert"
     style="color:#000; font-size:1em; background-color:white; padding:10px; margin:1px; text-align:left;">

#### ⇑⇓ Stats Note ⇑⇓

`In simple linear regression, R² is equal to the square of Pearson's r`

1. **Pearson's r  :**
   Ranges from -1 to +1, with -1 representing strong negative correlation, 0 no correlation, +1 representing strong positive correlation. Use when you want to understand the strength and direction of the relationship between two variables.
1. **R²  :**
   Represents the percentage of variance in the dependent variable that is explained by the independent variable. R² always ranges from 0 to 1 (explaining from 0% to 100% of the variance). Higher values indicate better fit. Use when assessing goodness-of-fit.

</div> 

In [None]:
# You can automatically correlate each column (numeric variable) in your dataframe 
# against every other (numeric variable) in your dataframe. Here, we only have two
# but check the relationship between Sea Surface Temperature vs Sea Surface Height

dataframe[['Temperature', 'Height', 'Salinity']].corr()

# NOTE: you can round values to n decimal places with '.round(n)' -- give it a try

<div class="alert alert-info" 
    style="font-size: 1em; padding: 10px; margin: 0px 0; text-align: left">
    
     The .corr() function outputs a 2 × 2 table of rows and columns with cross-correlation r-values for all the numeric 
     data columns 'Temperature', 'Height', 'Salinity' that we specified from our dataframe.

     If we just want the single Pearson r and/or Goodness of Fit R² outputs, we can use the same function in a slightly 
     different way. Take a look!
     
</div>

In [None]:
# - Output just the Pearson correlation coefficient, r 
#   or the 'goodness of fit', R² without a 2 x 2 table

r  = dataframe['Salinity'].corr(dataframe['Height'])
R2 = r ** 2

r, R2

<div class="alert alert-info" role="alert"
     style="color:#000; font-size:1em; background-color:white; padding:10px; margin:1px; text-align:left;">

#### ⇑⇓ Quick Reminder ⇑⇓
     
    Python's inbuilt f-string command can be used to print numerical output to n (usually 2) decimal places - this does 
    NOT change the actual values held by our variables (a, b, diff, prod, avrg), only the format of what is printed.

    Let's use it to display nicer output of r and R²
</div>

In [None]:
# Display output nicely

print(f"r  = {r:.2f}")
print(f"R² = {R2:.2f}")


### 6. We can also visualise correlation with `matplotlib` `scatter`

In [None]:
plt.scatter(dataframe['Salinity'], dataframe['Height'])

In [None]:
# To make plotting easier:

# Short variable names x,y
x = dataframe['Salinity']
y = dataframe['Height']


In [None]:
# To make our figure nicer:

# Step 1: create fig n with axes, set size to 4 × 4 (length, height)
fig1, ax = plt.subplots(figsize = (4, 4))

# Step 2: Scatter Salinity (x), Height (y)
plt.scatter(x, y)

# Step 3: Call axes (ax) to format

# e.g.: add dotted grey gridlines
ax.grid(True, color = 'lightgrey', 
        linestyle = ':', linewidth = 0.5)

# e.g.: add x and y-axis labels
ax.set_xlabel('Sea Surface Salinity (psu)')
ax.set_ylabel('Sea Surface Height (m)')

# e.g.: add figure header
ax.set_title("Scatterplot")

# Last step: display figure
plt.show()

<div class="alert alert-info" 
    style="font-size: 1em; padding: 10px; margin: 0px 0; text-align: left">

     Last step:
     Using the numpy plotfit function, it is super straightforward to calculate and fit a linear (1st degree polynomial)
     trendline.
     
     y = mx + c  where:
       • m is the slope
       • c is intercept
         
</div> 

In [None]:
# Apply polyfit to x, y, 1=1st degree
# output the slope m, the intercept c

m, c = np.polyfit(x, y, 1)

print(f"m = {m:.3f}")
print(f"c = {c:.2f}")


In [None]:
# Step 1: create fig n with axes, set size 4 × 4 inches
fig2, ax = plt.subplots(figsize = (4, 4))

# Step 2: Scatter Salinity, Height data from dataframe
plt.scatter(x, y)

# Step 3: Call axes (ax) to format
# e.g.: add red trendline
plt.plot(x, m*x + c, color = 'red', linewidth = 1,
         label = f"y = {m:.2f}x + {c:.2f}")

# e.g.: add dotted grey gridlines
ax.grid(True, color = 'lightgrey', 
        linestyle = ':', linewidth = 0.5)

# e.g.: add labels
ax.set_xlabel('Sea Surface Salinity (psu)')
ax.set_ylabel('Sea Surface Height (m)')

# e.g.: add header
ax.set_title("Scatterplot with Trendline")

# Show Legend
ax.legend()

# Last step: display figure
plt.show()

### 7. Save Python figures with `savefig`, inputting `name.png` and `dpi` resolution (dots per inch).

In [None]:
# Save your figure by uncommenting the next line of code
#fig2.savefig('Scatterplot_with_Trendline.png', dpi = 300, bbox_inches = 'tight')

<div class = "alert alert-info" role = "alert" 
     style = "color:#000; font-size: 1em; padding: 5px; margin: 0px; text-align: left">

#### **>** Optional
     Lineplot!
     If you're curious about how your variables change over time, try the code block below. 
</div> 

<div class = "alert alert-info" role = "alert" 
     style = "color:#000; font-size: 1.1em; padding: 10px; margin: 0px 0; text-align: left">


**☆ Your Input Needed Here: ☆**
    
1. Replace ['`Column Name Here`'] with your chosen dataframe variable when plotting
   - `x` = dataframe[`'Dates'`], `y` = dataframe[`'Column Name Here'`]
1. Add the corresponding details to `y-label` when setting ax, `axes`.
   
</div>

In [None]:
# First, when you're going to plot your data over time,
# ensure dataframe 'Dates' are in the 'datetime' format
dataframe['Dates'] = pd.to_datetime(dataframe['Dates'])

# -----------------------------------------------------
# Now make your lineplot of dataframe['name'] over time

# Create fig with axes, size 13 x 4 (l, h)
fig3, ax = plt.subplots(figsize = (13, 4))

# Lineplot of your selected variable over time,
# with units on the y-axis, Dates on the x-axis
plt.plot(dataframe['Dates'], dataframe['Column Name Here'])

# Format figure
# Add gridlines
ax.grid(True, color = 'lightgrey', 
        linestyle = ':', linewidth = 0.5)

# Add xy labels
ax.set_xlabel('Time')
ax.set_ylabel('Selected Variable (units)')

# Add a header
ax.set_title("Lineplot of variable over time")

# Show Legend
ax.legend()
# Show figure
plt.show()