# Core Sample Reader
## Sample project for Julia
## Python 3.x, Jupyter Notebook

### Set our user-controllable variables

**Important:** Your xlsx dataset will need to be properly formatted.  We use the "Mockup" sheet in data.xlsx for this example.  
`source` should point to an xlsx with a single cell sample dataset.  _-quick note, the `r` before 'string' means that we treat it as a string-literal -- raw and unmodified_  
`subject` should be the sheet name / sample subject in the source file
`threshold` defines the threshold we use to identify the column we wish to mark  
`cell_color` defines the highlight color in the resulting document

In [13]:
try:
    source
except:
    source = r'L1 NM raw data.xlsx'
try:
    threshold
except:
    threshold = 100
try:
    cell_color
except:
    cell_color = r'#009933'
try:
    showoutput
except:
    showoutput = True
try:
    subject
except:
    subject = r'F74'
try:
    showcharts
except:
    showcharts = True

### Install Notebook Dependencies
This tells python (specifically, the pip utility) to install a few dependencies that we're going to need.  `!` means "this is a shell command, not a python statement" in Jupyter, `{ }` interpolates the variable sys.executable into the statement.
  
-  numpy - swiss army tool for working with scientific data & numbers  
-  pandas - tool to work with ordered, tabulated data.  Also allows us to directly read & write to xlsx formats  
-  tabulate - allows us to pretty-print out our data in this example  
-  openpyxl, xlsxwriter, xlrd - libraries for us to write to an excel (xlsx) document

In [14]:
import sys
import sys
if showoutput:
    !{sys.executable} -m pip install numpy pandas tabulate openpyxl xlsxwriter xlrd nbformat Jinja2 plotly
else:
    !{sys.executable} -m pip install numpy pandas tabulate openpyxl xlsxwriter xlrd nbformat Jinja2 plotly > /dev/null


You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Next, we `import` the libraries we want to use with import statements.  This effectively makes anything in their namespaces available in our application.  In this case, we'll use the optional `as` statement to alias the pandas & numpy libraries, and `from` to import the tabulate & display module inside their respective packages.

In [15]:
import pandas as pd
import numpy as np
from IPython.display import display
from tabulate import tabulate

### Onto the code!
We use the Pandas library, which we've aliased to the `pd` namespace, to load the file contained in the variable `source`

Pandas is a handy utility for manipulating excel-style data.  We set the `Region` & `Channel` index here for the initial grouping

In [16]:
data = pd.read_excel(source, sheet_name=subject).set_index(['Region','Channel'])

FileNotFoundError: [Errno 2] No such file or directory: 'L1 NM raw data.xlsx'

#### Let's take a quick look at our data...
We've reformatted our data to be a little simpler to use with Pandas.  We could load the existing data as-is with some slight modifications, and duplicate keys will show up as `key key.1 key.2 ...`, etc  
This would require us to melt the table, pivot, & group - I think a better approach if possible is to restructure the data slightly as seen in the "Mockup" sheet.  I can help you write a data converter for all of your existing samples if you'd like, but the code will be a little difficult to follow.  

So, for now you can see we've added Sample & Core columns and dropped it into a uniform array we can easily use in Pandas

In [None]:
if showoutput:
    display(data)
    print(len(data))

Yay! We have our data.  Now, we don't actually *need* to sort our data to identify the first core.max Δ > threshold, but we'll do it anyway to simplify the code even more.  There are multiple approaches here using aggregates, but the simplest is to just sort the lot using our indexes & regroup it.

In [None]:
sorted_data = data.sort_values(by=['Region','Channel','Max'])
if showoutput:
    display(sorted_data)

Great, table is sorted by Max, now we can group the regions & channels for our Max Δ calculation

In [None]:
sorted_data['Max_Delta'] = sorted_data.groupby(['Region','Channel'])['Max'].diff()
if showoutput:
    display(sorted_data)

Done! Now we just need to highlight our target data and write it out to an xlsx 

Now we generate a threshold table, using our calculated Max_Delta column.  We group on our indexes, and invoke first() on each group.  
There are many approaches in selecting a first value, but Python & Pandas doesn't give us an elegant solution out of the box, and this is the easiest approach.  
Additionally, we add the column `Max_Delta_First_Hit` to the `first_threshold_table` dataframe that we'll need later on.

In [None]:
first_threshold_table = sorted_data.query('Max_Delta >= 100').groupby(['Region','Channel']).first()
first_threshold_table['Max_Delta_First_Hit'] = True
if showoutput:
    display(first_threshold_table)

Next we need to merge the `first_threshold_table` dataframe with our `sorted_data` dataframe.  Calling .merge() will drop the indexes, so we first invoke reset_index() first  
Our merge strategy in this scenario will be `left` - I'd recommend reading up on SQL & structured data `JOIN` operations to understand the methodology

In [None]:
merged_data = sorted_data.reset_index().merge(first_threshold_table, how='left')
if showoutput:
    display(merged_data)

Let's define a styler function to color our cells...  
This bit is a little more complicated and convoluted due to how Pandas works.

In [None]:
def style_when_true(series):
    match_table = [1 if x == True else 0 for x in series]
    return [f'background-color: {cell_color}' if v else '' for v in match_table]

Next, we want to apply our formatter to highlight the cell using `Max_Delta_First_Hit`

In [None]:
formatted_data = merged_data.style.apply(style_when_true, subset='Max_Delta_First_Hit')
if showoutput:
        display(merged_data.head(25).style.apply(style_when_true, subset='Max_Delta_First_Hit'))

## Time for part 2 - let's massage our data a little bit.

In [None]:
#Define a new channel-based dataframe
df_channel = pd.DataFrame(columns=[
    'Region',
    'Channel',
    'Positive',
    'Negative',
    'Count',
    'Positive_Ratio',
])

#Define a new region-based dataframe
df_region = pd.DataFrame(columns=[
    'Region',
    'Count',
    'C1_Positive',
    'C1_Positive_Ratio',
    'C2_Positive',
    'C2_Positive_Ratio',
    'C3_Positive',
    'C3_Positive_Ratio',
    'C1C2_Positive',
    'C1C2_Positive_Ratio',
    'C1C3_Positive',
    'C1C3_Positive_Ratio',
    'C2C3_Positive',
    'C2C3_Positive_Ratio',
    'C1C2C3_Positive',
    'C1C2C3_Positive_Ratio',
    'C1-Only_Positive',
    'C1-Only_Positive_Ratio',
    'C2-Only_Positive',
    'C2-Only_Positive_Ratio',
    'C3-Only_Positive',
    'C3-Only_Positive_Ratio',
    'C1C2-Only_Positive',
    'C1C2-Only_Positive_Ratio',
    'C1C3-Only_Positive',
    'C1C3-Only_Positive_Ratio',
    'C2C3-Only_Positive',
    'C2C3-Only_Positive_Ratio',
    'Negative',
    'Negative_Ratio',
])

#Calculate our channel data, where a positive result is derived from Max > 0
for i in merged_data.groupby(['Region','Channel']):
    df_channel = df_channel.append({
        'Region':i[0][0],
        'Channel':i[0][1],
        'Positive':sum(i[1]['Max'] > 0),
        'Negative':sum(i[1]['Max'] <= 0),
        'Count':len(i[1]),
        'Positive_Ratio':sum(i[1]['Max'] > 0) / len(i[1]['Max'])
        }, ignore_index=True)

#Set indexer on channel dataframe
df_channel = df_channel.set_index('Region')

#Fill region dataframe with regions & zero out all values
df_region.Region  = merged_data.Region.unique()
df_region = df_region.set_index('Region')
for col in df_region.columns:
    df_region[col].values[:] = 0

#Calculate our region data, where a positive result is derived from Max > 0 with 'AND' channel groupings
for i in merged_data.groupby(['Region','ROI']):
    try:
        df_region.at[i[1].iloc[0]['Region'], 'Count'] += 1
        df_region.at[i[1].iloc[0]['Region'], 'C1_Positive'] += i[1].iloc[0]['Max'] > 0
        df_region.at[i[1].iloc[1]['Region'], 'C2_Positive'] += i[1].iloc[1]['Max'] > 0
        df_region.at[i[1].iloc[2]['Region'], 'C3_Positive'] += i[1].iloc[2]['Max'] > 0
        df_region.at[i[1].iloc[2]['Region'], 'C1C2_Positive'] += (i[1].iloc[0]['Max'] > 0) and (i[1].iloc[1]['Max'] > 0)
        df_region.at[i[1].iloc[2]['Region'], 'C1C3_Positive'] += (i[1].iloc[0]['Max'] > 0) and (i[1].iloc[2]['Max'] > 0)
        df_region.at[i[1].iloc[2]['Region'], 'C2C3_Positive'] += (i[1].iloc[1]['Max'] > 0) and (i[1].iloc[2]['Max'] > 0)
        df_region.at[i[1].iloc[0]['Region'], 'C1-Only_Positive'] += i[1].iloc[0]['Max'] > 0 and (i[1].iloc[1]['Max'] <= 0) and (i[1].iloc[2]['Max'] <= 0)
        df_region.at[i[1].iloc[1]['Region'], 'C2-Only_Positive'] += i[1].iloc[1]['Max'] > 0 and (i[1].iloc[0]['Max'] <= 0) and (i[1].iloc[2]['Max'] <= 0)
        df_region.at[i[1].iloc[2]['Region'], 'C3-Only_Positive'] += i[1].iloc[2]['Max'] > 0 and (i[1].iloc[0]['Max'] <= 0) and (i[1].iloc[1]['Max'] <= 0)
        df_region.at[i[1].iloc[2]['Region'], 'C1C2-Only_Positive'] += (i[1].iloc[0]['Max'] > 0) and (i[1].iloc[1]['Max'] > 0) and (i[1].iloc[2]['Max'] <= 0)
        df_region.at[i[1].iloc[2]['Region'], 'C1C3-Only_Positive'] += (i[1].iloc[0]['Max'] > 0) and (i[1].iloc[2]['Max'] > 0) and (i[1].iloc[1]['Max'] <= 0)
        df_region.at[i[1].iloc[2]['Region'], 'C2C3-Only_Positive'] += (i[1].iloc[1]['Max'] > 0) and (i[1].iloc[2]['Max'] > 0) and (i[1].iloc[0]['Max'] <= 0)
        df_region.at[i[1].iloc[2]['Region'], 'C1C2C3_Positive'] += (i[1].iloc[0]['Max'] > 0) and (i[1].iloc[1]['Max'] > 0) and (i[1].iloc[2]['Max'] > 0)
        df_region.at[i[1].iloc[2]['Region'], 'Negative'] += (i[1].iloc[0]['Max'] <= 0) and (i[1].iloc[1]['Max'] <= 0) and (i[1].iloc[2]['Max'] <= 0)
    except Exception as e:
        print("Data input issue, we ran into a problem here:")
        print(i)
        raise(e)

#Calculate ratios
df_region['C1_Positive_Ratio'] = (df_region['C1_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C2_Positive_Ratio'] = (df_region['C2_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C3_Positive_Ratio'] = (df_region['C3_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C1C2_Positive_Ratio'] = (df_region['C1C2_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C1C3_Positive_Ratio'] = (df_region['C1C3_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C2C3_Positive_Ratio'] = (df_region['C2C3_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C1-Only_Positive_Ratio'] = (df_region['C1-Only_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C2-Only_Positive_Ratio'] = (df_region['C2-Only_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C3-Only_Positive_Ratio'] = (df_region['C3-Only_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C1C2-Only_Positive_Ratio'] = (df_region['C1C2-Only_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C1C3-Only_Positive_Ratio'] = (df_region['C1C3-Only_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C2C3-Only_Positive_Ratio'] = (df_region['C2C3-Only_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['C1C2C3_Positive_Ratio'] = (df_region['C1C2C3_Positive'] / df_region['Count']).astype(np.double).round(5)
df_region['Negative_Ratio'] = (df_region['Negative'] / df_region['Count']).astype(np.double).round(5)
if showoutput:
    display(df_region)
    display(df_channel.groupby(['Region','Channel']).first())

Just about done! All we need to do now is write out our dataframes to a new xlsx document.  We manually set our column sizes to keep things simple & readable.
We currently write out a summary sheet containing the selected cells which exceeded the threshold in the dataset, and the data with highlighted styling.

In [None]:
with pd.ExcelWriter(f"{subject}.xlsx", engine='xlsxwriter') as writer:
    first_threshold_table.to_excel(writer, sheet_name=f'{subject}-summary')
    df_region.to_excel(writer, sheet_name=f'{subject}-regions')
    df_channel.to_excel(writer, sheet_name=f'{subject}-channels')
    formatted_data.to_excel(writer, sheet_name=subject, index=True)
    
    #Set column formats
    workbook = writer.book
    region_worksheet = writer.sheets[f'{subject}-regions']
    pct_format = workbook.add_format({'num_format': '0.00%'})
    region_worksheet.set_column('D:D', None, pct_format)
    region_worksheet.set_column('F:F', None, pct_format)
    region_worksheet.set_column('H:H', None, pct_format)
    region_worksheet.set_column('J:J', None, pct_format)
    region_worksheet.set_column('L:L', None, pct_format)
    region_worksheet.set_column('N:N', None, pct_format)
    region_worksheet.set_column('P:P', None, pct_format)
    region_worksheet.set_column('R:R', None, pct_format)
    region_worksheet.set_column('T:T', None, pct_format)
    region_worksheet.set_column('V:V', None, pct_format)
    region_worksheet.set_column('X:X', None, pct_format)
    region_worksheet.set_column('Z:Z', None, pct_format)
    region_worksheet.set_column('AB:AB', None, pct_format)
    region_worksheet.set_column('AD:AD', None, pct_format)
    
    #Autofit columns
    for sheetname, df in writer.sheets.items():
        worksheet = writer.sheets[sheetname]
        worksheet.set_column(0, 29, 25)
print(f"Successfully wrote {subject}.xlsx")

## Congrats! We're done, and you should now have a new xlsx file based on the subject name.
If this is the first time you've run the notebook, you should see a "Mockup.xlsx" file that you can right click and download

## Bonus
Graphing your data is easy!

In [None]:
# Install the plotly & jupyter widgets packages
if showoutput:
    !{sys.executable} -m pip install plotly ipywidgets
else:
    !{sys.executable} -m pip install plotly ipywidgets > /dev/null
import plotly.express as px

Draw an example chart using our original dataframe object

In [None]:
if showcharts:
    for key in data.groupby(['Region','Channel']).groups.keys():
        df = data.groupby(['Region','Channel']).get_group(key).reset_index().set_index('ROI')
        fig = px.line(df, x=df.index, y=['Mean','Min','Area'], title=' - '.join(key))
        fig.show()