This is an example of badly stuctured .xlsx data. It's doesn't seem corrupted, missing, having irregular naming - just too sparse. It could be a result of a data pipline providing it this way automatically at the end. 

# Reading the table

In [7]:
import pandas as pd

# first, reading the dataset from .xlsx file
df = pd.read_excel('datasets/1.-Badly-Structured-Sales-Data-1.xlsx')
df

Unnamed: 0,Segment>>,Consumer,Unnamed: 2,Unnamed: 3,Unnamed: 4,Consumer Total,Corporate,Unnamed: 7,Unnamed: 8,Unnamed: 9,Corporate Total,Home Office,Unnamed: 12,Unnamed: 13,Unnamed: 14,Home Office Total
0,Ship Mode>>,First Class,Same Day,Second Class,Standard Class,,First Class,Same Day,Second Class,Standard Class,,First Class,Same Day,Second Class,Standard Class,
1,Order ID,,,,,,,,,,,,,,,
2,CA-2011-100293,,,,,,,,,,,,,,91.056,91.0560
3,CA-2011-100706,,,129.44,,129.440,,,,,,,,,,
4,CA-2011-100895,,,,605.47,605.470,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
820,US-2014-166611,,,,,,,,,68.742,68.7420,,,,,
821,US-2014-167920,,,1827.51,,1827.510,,,,,,,,,,
822,US-2014-168116,,,,,,,8167.42,,,8167.4200,,,,,
823,US-2014-168690,,,,2.808,2.808,,,,,,,,,,


There is a lot of NaNs.

Because Pandas doesn't know anything about this dataset, it reads the table
as a single-indexed table. However, looking at the 1st row of this table, it becomes clear that it was supposed to be a multiindexed table.
Therefore, it's better to read it telling pandas that rows 0 and 1 are headers, and column 1 is the index, since all the order IDs are unique.

In [8]:
df = pd.read_excel('datasets/1.-Badly-Structured-Sales-Data-1.xlsx', header=[0, 1], index_col=0)
df

Segment>>,Consumer,Consumer,Consumer,Consumer,Consumer Total,Corporate,Corporate,Corporate,Corporate,Corporate Total,Home Office,Home Office,Home Office,Home Office,Home Office Total
Ship Mode>>,First Class,Same Day,Second Class,Standard Class,Unnamed: 5_level_1,First Class,Same Day,Second Class,Standard Class,Unnamed: 10_level_1,First Class,Same Day,Second Class,Standard Class,Unnamed: 15_level_1
Order ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
CA-2011-100293,,,,,,,,,,,,,,91.0560,91.0560
CA-2011-100706,,,129.4400,,129.440,,,,,,,,,,
CA-2011-100895,,,,605.4700,605.470,,,,,,,,,,
CA-2011-100916,,,,,,,,,788.8600,788.8600,,,,,
CA-2011-101266,,,13.3600,,13.360,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
US-2014-166611,,,,,,,,,68.7420,68.7420,,,,,
US-2014-167920,,,1827.5100,,1827.510,,,,,,,,,,
US-2014-168116,,,,,,,8167.420,,,8167.4200,,,,,
US-2014-168690,,,,2.8080,2.808,,,,,,,,,,


Next thing: we do not need Total columns, since they do not represent anything but individual values for each order ID. We will drop them in the next move.

# Cleaning the table from unnecessary columns & rows

In [9]:
# Dropping the useless columns:
df.drop(['Consumer Total', 'Corporate Total', 'Home Office Total'], axis=1, inplace=True)
df

Segment>>,Consumer,Consumer,Consumer,Consumer,Corporate,Corporate,Corporate,Corporate,Home Office,Home Office,Home Office,Home Office
Ship Mode>>,First Class,Same Day,Second Class,Standard Class,First Class,Same Day,Second Class,Standard Class,First Class,Same Day,Second Class,Standard Class
Order ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
CA-2011-100293,,,,,,,,,,,,91.0560
CA-2011-100706,,,129.4400,,,,,,,,,
CA-2011-100895,,,,605.4700,,,,,,,,
CA-2011-100916,,,,,,,,788.8600,,,,
CA-2011-101266,,,13.3600,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
US-2014-166611,,,,,,,,68.7420,,,,
US-2014-167920,,,1827.5100,,,,,,,,,
US-2014-168116,,,,,,8167.420,,,,,,
US-2014-168690,,,,2.8080,,,,,,,,


Yet another thing: Grand Total is not a valid Order ID, it is an aggregate sum of all sales for each Ship Mode. Let's exclude it from our data by slicing the last row off.

In [10]:
df = df[0:822]
df

Segment>>,Consumer,Consumer,Consumer,Consumer,Corporate,Corporate,Corporate,Corporate,Home Office,Home Office,Home Office,Home Office
Ship Mode>>,First Class,Same Day,Second Class,Standard Class,First Class,Same Day,Second Class,Standard Class,First Class,Same Day,Second Class,Standard Class
Order ID,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
CA-2011-100293,,,,,,,,,,,,91.056
CA-2011-100706,,,129.44,,,,,,,,,
CA-2011-100895,,,,605.470,,,,,,,,
CA-2011-100916,,,,,,,,788.860,,,,
CA-2011-101266,,,13.36,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
US-2014-166233,,,,24.000,,,,,,,,
US-2014-166611,,,,,,,,68.742,,,,
US-2014-167920,,,1827.51,,,,,,,,,
US-2014-168116,,,,,,8167.42,,,,,,


# Consolidating the table

Finally, we come to a step to convert this multiindexed table into single-indexed.
In essence, multiindex is a Groupby view, where an extra dimension is added. In our case, 
the data is grouped by Segment and Ship Mode, but this extra dimension (Sale amount) is redundant and therefore creates
a lot of NaNs. 

Segment and Ship Mode shall be column names, and the numbers should go into 'Sales' column.

In [11]:
# getting the Series object
ser = df.unstack()
# applying this mask to get just numeric values from Series
cleaned_df = ser[ser.isnull() != True] 
cleaned_df

Segment>>    Ship Mode>>     Order ID      
Consumer     First Class     CA-2011-103366    149.950
                             CA-2011-109043    243.600
                             CA-2011-113166      9.568
                             CA-2011-124023      8.960
                             CA-2011-130155     34.200
                                                ...   
Home Office  Standard Class  US-2014-129224      4.608
                             US-2014-132031    513.496
                             US-2014-132297    598.310
                             US-2014-132675    148.160
                             US-2014-156083      9.664
Length: 822, dtype: float64

In [12]:
# Naming the column
final_df = cleaned_df.to_frame(name='Sales')
final_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Sales
Segment>>,Ship Mode>>,Order ID,Unnamed: 3_level_1
Consumer,First Class,CA-2011-103366,149.950
Consumer,First Class,CA-2011-109043,243.600
Consumer,First Class,CA-2011-113166,9.568
Consumer,First Class,CA-2011-124023,8.960
Consumer,First Class,CA-2011-130155,34.200
...,...,...,...
Home Office,Standard Class,US-2014-129224,4.608
Home Office,Standard Class,US-2014-132031,513.496
Home Office,Standard Class,US-2014-132297,598.310
Home Office,Standard Class,US-2014-132675,148.160


In [13]:
# Converting Multiindex into columns:
final_df = final_df.reset_index()

# Changing column names to more conventionally used:
final_df.columns.values[0] = 'Segment'
final_df.columns.values[1] = 'Ship_Mode'
final_df.columns.values[2] = 'Order_ID'

# Rearranging column order:
final_df = final_df[['Order_ID', 'Segment', 'Ship_Mode', 'Sales']]
final_df


Unnamed: 0,Order_ID,Segment,Ship_Mode,Sales
0,CA-2011-103366,Consumer,First Class,149.950
1,CA-2011-109043,Consumer,First Class,243.600
2,CA-2011-113166,Consumer,First Class,9.568
3,CA-2011-124023,Consumer,First Class,8.960
4,CA-2011-130155,Consumer,First Class,34.200
...,...,...,...,...
817,US-2014-129224,Home Office,Standard Class,4.608
818,US-2014-132031,Home Office,Standard Class,513.496
819,US-2014-132297,Home Office,Standard Class,598.310
820,US-2014-132675,Home Office,Standard Class,148.160


We don't know anything about Sales units. It might be currency, but also might be thousands or even millions.
Therefore, 3rd sig fig precision is preserved. 

What we can do is calculate aggregate sums as it was in the original table, grouping by Segment:

# Calculating the aggregates

In [15]:
print(final_df.groupby(['Segment']).sum())
print('Grand Total:', final_df.Sales.sum())

                   Sales
Segment                 
Consumer     195580.9710
Corporate    121885.9325
Home Office   74255.0015
Grand Total: 391721.905


This concludes part 1.