# Pandas Series
Series is a 1D labeled array capable of holding any data type-int, str, float, python objects, etc.

### Creating Series using Python list or dict

In [5]:
import pandas as pd
import numpy as np

s = pd.Series([1,2,3,4,5])
s

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [6]:
s = pd.Series(['Xtern', 'abc', '123'])
s

0    Xtern
1      abc
2      123
dtype: object

In [8]:
s = pd.Series({'a': 1, 'b': 2, 'c': 3})
s

a    1
b    2
c    3
dtype: int64

### Creating Series from numpy ndarray

In [11]:
data = np.array([10, 40, 45, 67, 78])

series = pd.Series(data)
series

0    10
1    40
2    45
3    67
4    78
dtype: Int64

In [31]:
data = np.random.randint(10, 50, size = (2,3))

series = pd.Series(data)
series

ValueError: Data must be 1-dimensional, got ndarray of shape (2, 3) instead

#### We get a value Error because we are trying to insert a 2D array into a 1D data structure in pandas

### Creating Series from scalar

In [13]:
index = ['a', 'b', 'c', 'd']

pd.Series(5.5, index = index)

a    5.5
b    5.5
c    5.5
d    5.5
dtype: float64

### Accessing properties/attributes and methods of Series

In [14]:
data = np.array([10, 23, 45, 67, 89])

series = pd.Series(data)
series

0    10
1    23
2    45
3    67
4    89
dtype: Int64

#### Attributes

In [15]:
print('Data type:', series.dtype, '\n')
print('Shape:', series.shape, '\n')
print('Values:', series.values, '\n')
print('Array:', series.array, '\n')

Data type: Int64 

Shape: (5,) 

Values: <IntegerArray>
[10, 23, 45, 67, 89]
Length: 5, dtype: Int64 

Array: <IntegerArray>
[10, 23, 45, 67, 89]
Length: 5, dtype: Int64 



#### Methods

In [16]:
# to extract back to numpy array

to_numpy = series.to_numpy()
to_numpy

array([10, 23, 45, 67, 89])

In [17]:
series.head()

0    10
1    23
2    45
3    67
4    89
dtype: Int64

In [18]:
series.tail()

0    10
1    23
2    45
3    67
4    89
dtype: Int64

In [20]:
series.info()

<class 'pandas.core.series.Series'>
RangeIndex: 5 entries, 0 to 4
Series name: None
Non-Null Count  Dtype
--------------  -----
5 non-null      Int64
dtypes: Int64(1)
memory usage: 173.0 bytes


### Accessing Data Using Indexing and Slicing

In [21]:
s = pd.Series([1, 2, 3, 4, 5, 6])

print(s[2])

3


In [22]:
print(s[1:])

print(s[1:4])

1    2
2    3
3    4
4    5
5    6
dtype: int64
1    2
2    3
3    4
dtype: int64


In [23]:
# to retrieve multiple elements (rows and columns)

print(s[[2, 4]])

2    3
4    5
dtype: int64


In [24]:
index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

s = pd.Series([1, 2, 3, 4, 5, 6, 7], index = index)
s

a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64

In [25]:
print(s['a'])

1


In [26]:
# to retrieve multiple elements (rows and columns)

print(s[['a', 'c', 'e']])

a    1
c    3
e    5
dtype: int64


In [27]:
print(s['h'])

KeyError: 'h'

The 'h' index or label is missing, that's why we're getting that KeyError. Using Series.get() a missing label will return None or a specific default.

In [28]:
print(s.get('h'))

None


In [32]:
# OR

print(s.get('h', np.nan))

nan


## Uses of Pandas Series
### Store One-Dimensional Data:

- It’s great for storing a single column of data like a list of names, numbers, dates, etc.

### Labeling Data with Index:

- You can assign custom labels (index) to each value. This is helpful for referencing or slicing data by name rather than position.

### Data Manipulation:

- Easy to apply operations (like filtering, arithmetic, etc.) across all elements.

- Built-in vectorized operations, making it more efficient than regular Python lists.

### Missing Data Handling:

- Can handle NaN (Not a Number) values gracefully, useful for real-world data.

### Time Series Data:

- Ideal for indexing and working with time series data.

### Intermediate Step in Data Analysis:

- Often used when working with or extracting columns from a DataFrame to perform operations before assigning results back.

## When to Use Pandas Series
- When you're working with one-dimensional data (like a list or a column from a DataFrame).

- When you need label-based indexing for elements.

- When you want to perform vectorized operations on a 1D dataset.

- When you're preparing or cleaning a single column of a larger dataset.

- When building or analyzing features for machine learning one column at a time.

- When handling time-indexed data (like stock prices or logs).

# Pandas DataFrame

- It is a 2D labeled data structure
- They are value mutable and size-mutable
- A tabular structure with hetereogenously-typed columns
- In pandas a data table is called a DataFrame whereas each column is called a Series

### Syntax for creating a Pandas DataFrame

In [33]:
df = pd.DataFrame(data, index=idx, columns=cols)

NameError: name 'idx' is not defined

Data can be many different things, e.g, python, dict, list, or tuple.

### Creating a dataframe using python dictionary

In [34]:
data = {
    'Name': ['Ann', 'Jane', 'Xavier', 'Justina'],
    'Age': [40, 46, 58, 67],
    'Gender': ["Female", "Female", "Male", "Female"]
}

df = pd.DataFrame(data)

df

Unnamed: 0,Name,Age,Gender
0,Ann,40,Female
1,Jane,46,Female
2,Xavier,58,Male
3,Justina,67,Female


### Creating a df using tuple or lists

### Tuples in a list

In [39]:
data = [
   ('1/1/2019', 23, 45, 'Rain'),
   ('2/3/2019', 45, 12, 'Fog'),
   ('1/4/2019', 11, 34, 'Bright'),
   ('3/4/2019', 34, 56, 'Storm')
]

df = pd.DataFrame(data)

df

Unnamed: 0,0,1,2,3
0,1/1/2019,23,45,Rain
1,2/3/2019,45,12,Fog
2,1/4/2019,11,34,Bright
3,3/4/2019,34,56,Storm


### Creating a df using tuple or lists and indicating column names

### Tuples in a tuple

In [38]:

data = (
   ('1/1/2019', 23, 45, 'Rain'),
   ('2/3/2019', 45, 12, 'Fog'),
   ('1/4/2019', 11, 34, 'Bright'),
   ('3/4/2019', 34, 56, 'Storm')
)

df = pd.DataFrame(data, columns=['Date', 'Temperature', 'Windspeed', 'Event'])

df

Unnamed: 0,Date,Temperature,Windspeed,Event
0,1/1/2019,23,45,Rain
1,2/3/2019,45,12,Fog
2,1/4/2019,11,34,Bright
3,3/4/2019,34,56,Storm


### Creating a df using tuple or lists, indicating indexes and column names

### Lists in a Tuple

In [37]:
data = (
    ['1/1/2019', 23, 45, 'Rain'],
    ['2/3/2019', 45, 12, 'Fog'],
    ['1/4/2019', 11, 34, 'Bright'],
    ['3/4/2019', 34, 56, 'Storm']
)

df = pd.DataFrame(data, index=['T1', 'T2', 'T3', 'T4'], columns=['Date', 'Temperature', 'Windspeed', 'Event'])

df

Unnamed: 0,Date,Temperature,Windspeed,Event
T1,1/1/2019,23,45,Rain
T2,2/3/2019,45,12,Fog
T3,1/4/2019,11,34,Bright
T4,3/4/2019,34,56,Storm


### Creating dataframe using numpy array

In [40]:
arr = np.random.randint(100, 201, size=(1000, 100))
arr

array([[179, 105, 111, ..., 154, 174, 101],
       [114, 127, 197, ..., 136, 135, 184],
       [194, 198, 167, ..., 197, 138, 102],
       ...,
       [159, 105, 166, ..., 193, 125, 195],
       [155, 180, 142, ..., 158, 183, 136],
       [148, 175, 159, ..., 126, 139, 137]], dtype=int32)

### Converting to a dataframe and adding column names to the dataframe

In [42]:
df = pd.DataFrame(arr, columns=['col_'+str(i) for i in range(1, 101)])

df

Unnamed: 0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,...,col_91,col_92,col_93,col_94,col_95,col_96,col_97,col_98,col_99,col_100
0,179,105,111,123,152,170,192,155,129,200,...,104,125,168,145,164,111,196,154,174,101
1,114,127,197,119,118,156,182,172,117,176,...,152,154,198,143,108,143,113,136,135,184
2,194,198,167,142,144,152,180,137,147,105,...,186,175,133,131,140,172,175,197,138,102
3,123,119,191,142,157,119,129,160,124,195,...,107,138,191,149,182,158,147,165,189,160
4,156,196,178,167,113,103,146,103,177,200,...,167,152,160,145,150,168,140,139,101,116
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,127,108,187,149,149,101,140,108,184,182,...,135,119,180,144,183,130,118,139,134,110
996,112,193,105,151,195,126,165,133,142,162,...,199,116,111,139,134,191,132,121,112,178
997,159,105,166,195,180,121,110,169,121,119,...,141,197,168,145,104,165,129,193,125,195
998,155,180,142,157,179,167,117,122,198,171,...,174,127,145,179,158,194,138,158,183,136


### Accessing Attributes/Properties and Methods of DataFrame

In [43]:
# Creating dictionary of series
data = {
   'Name': pd.Series(['Neymar', 'Muse', 'Abu', 'Shalom', 'Jacky']),
   'Age': pd.Series([34, 56, 78, 23, 12]),
   'Rating': pd.Series([3.4, 5.0, 2.3, 1.0, np.nan])
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Rating
0,Neymar,34,3.4
1,Muse,56,5.0
2,Abu,78,2.3
3,Shalom,23,1.0
4,Jacky,12,


In [46]:
# Attributes

print('Shape of the df: ', df.shape, '\n')
print('Names of the columns: ', df.columns, '\n')
print('Data types for each column: \n', df.dtypes, '\n')
print('Axes: \n', df.axes, '\n')
print('Returning data as a numpy array: \n', df.values, '\n')

Shape of the df:  (5, 3) 

Names of the columns:  Index(['Name', 'Age', 'Rating'], dtype='object') 

Data types for each column: 
 Name       object
Age         int64
Rating    float64
dtype: object 

Axes: 
 [RangeIndex(start=0, stop=5, step=1), Index(['Name', 'Age', 'Rating'], dtype='object')] 

Returning data as a numpy array: 
 [['Neymar' 34 3.4]
 ['Muse' 56 5.0]
 ['Abu' 78 2.3]
 ['Shalom' 23 1.0]
 ['Jacky' 12 nan]] 



In [47]:
# for more technical information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    5 non-null      object 
 1   Age     5 non-null      int64  
 2   Rating  4 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes


This methods provides us information about the following:
- The kind of data structure
- The number of entries/rows
- The row label or index ranging from 0 to 4 in this case
- Number of columns and the count of non-nulls for each column
- The number of columns with a data type
- The approximate amount of RAM used to hold the dataframe.

In [48]:
df.head()

Unnamed: 0,Name,Age,Rating
0,Neymar,34,3.4
1,Muse,56,5.0
2,Abu,78,2.3
3,Shalom,23,1.0
4,Jacky,12,


In [50]:
df.tail(2)

Unnamed: 0,Name,Age,Rating
3,Shalom,23,1.0
4,Jacky,12,


### Working With Tabular Data

Pandas supports the integration with many file formats or data sources like csv, excel, sql, json, parquet, etc. To get data into pandas, the function with the prefix read_ is used. To export data out of pandas into other forms we use the prefix to_*.

In [52]:
data = {
        'Name': pd.Series(['Neymar', 'Muse', 'Abu', 'Shalom', 'Jacky']),
        'Age': pd.Series([34, 56, 78,23,12]),
        'Rating': pd.Series([3.4, 5.0, 2.3, 1.0, np.nan])
}

df = pd.DataFrame(data)
df

Unnamed: 0,Name,Age,Rating
0,Neymar,34,3.4
1,Muse,56,5.0
2,Abu,78,2.3
3,Shalom,23,1.0
4,Jacky,12,


In [53]:
# Writing dataframe to csv

df.to_csv('csv1.csv')

In [54]:
# Writing dataframe to csv without index

df.to_csv('csv2.csv', index=False)

In [58]:
# Writing dataframe to xlsx

df.to_excel('excel1.xl

ModuleNotFoundError: No module named 'openpyxl'

In [56]:
!pip install openpyxl

Defaulting to user installation because normal site-packages is not writeable


In [57]:
help(df.to_excel)

Help on method to_excel in module pandas.core.generic:

to_excel(excel_writer: 'FilePath | WriteExcelBuffer | ExcelWriter', *, sheet_name: 'str' = 'Sheet1', na_rep: 'str' = '', float_format: 'str | None' = None, columns: 'Sequence[Hashable] | None' = None, header: 'Sequence[Hashable] | bool_t' = True, index: 'bool_t' = True, index_label: 'IndexLabel | None' = None, startrow: 'int' = 0, startcol: 'int' = 0, engine: "Literal['openpyxl', 'xlsxwriter'] | None" = None, merge_cells: 'bool_t' = True, inf_rep: 'str' = 'inf', freeze_panes: 'tuple[int, int] | None' = None, storage_options: 'StorageOptions | None' = None, engine_kwargs: 'dict[str, Any] | None' = None) -> 'None' method of pandas.core.frame.DataFrame instance
    Write object to an Excel sheet.
    
    To write a single object to an Excel .xlsx file it is only necessary to
    specify a target file name. To write to multiple sheets it is necessary to
    create an `ExcelWriter` object with a target file name, and specify a sheet
 

In [59]:
# Writing dataframe to excel without index

df.to_excel('excel1.xlsx', sheet_name='stud_data', index=False)

ModuleNotFoundError: No module named 'openpyxl'

### Reading .xlsx file

In [60]:
import pandas as pd

df = pd.read_excel('excel2.xlsx')

FileNotFoundError: [Errno 2] No such file or directory: 'excel2.xlsx'

### Read .csv Files - Iris DataSet

In [63]:
import pandas as pd

df = pd.read_csv('Iris.csv')

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             150 non-null    int64  
 1   SepalLengthCm  150 non-null    float64
 2   SepalWidthCm   150 non-null    float64
 3   PetalLengthCm  150 non-null    float64
 4   PetalWidthCm   150 non-null    float64
 5   Species        150 non-null    object 
dtypes: float64(4), int64(1), object(1)
memory usage: 7.2+ KB


In [65]:
df.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


In [66]:
df.tail(10)

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
140,141,6.7,3.1,5.6,2.4,Iris-virginica
141,142,6.9,3.1,5.1,2.3,Iris-virginica
142,143,5.8,2.7,5.1,1.9,Iris-virginica
143,144,6.8,3.2,5.9,2.3,Iris-virginica
144,145,6.7,3.3,5.7,2.5,Iris-virginica
145,146,6.7,3.0,5.2,2.3,Iris-virginica
146,147,6.3,2.5,5.0,1.9,Iris-virginica
147,148,6.5,3.0,5.2,2.0,Iris-virginica
148,149,6.2,3.4,5.4,2.3,Iris-virginica
149,150,5.9,3.0,5.1,1.8,Iris-virginica


In [67]:
df.sum()

Id                                                           11325
SepalLengthCm                                                876.5
SepalWidthCm                                                 458.1
PetalLengthCm                                                563.8
PetalWidthCm                                                 179.8
Species          Iris-setosaIris-setosaIris-setosaIris-setosaIr...
dtype: object

In [70]:
df.sum(axis=1)

TypeError: unsupported operand type(s) for +: 'float' and 'str'

The error indicates that we are trying to perform a sum on a column that does not contain numbers. Let's fix this.

In [71]:
df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']].sum(axis=1)

0      10.2
1       9.5
2       9.4
3       9.4
4      10.2
       ... 
145    17.2
146    15.7
147    16.7
148    17.3
149    15.8
Length: 150, dtype: float64

Others Include:
- min() and max()
- mean(), median(), var(), and std()
- count(), nunique(), unique(), and value_counts() for categorical columns

In [72]:
df.dtypes

Id                 int64
SepalLengthCm    float64
SepalWidthCm     float64
PetalLengthCm    float64
PetalWidthCm     float64
Species           object
dtype: object

### Creating a new dataframe consisting only of float datatype

In [73]:
num_cols = df.select_dtypes(include=['float64']).columns

num_cols

Index(['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm'], dtype='object')

In [75]:
# subsetting

df[num_cols].median()

SepalLengthCm    5.80
SepalWidthCm     3.00
PetalLengthCm    4.35
PetalWidthCm     1.30
dtype: float64

In [76]:
# subsetting

df[num_cols].std()

SepalLengthCm    0.828066
SepalWidthCm     0.433594
PetalLengthCm    1.764420
PetalWidthCm     0.763161
dtype: float64

In [79]:
df['Species'].count()

np.int64(150)

In [80]:
df['Species'].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: count, dtype: int64

print(df['Species'].unique())
print(df['Species'].nunique)

### Using describe() to Summarize the Data

In [82]:
# summarizes all the numerical columns

df.describe()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [83]:
df.describe(include=['object'])

Unnamed: 0,Species
count,150
unique,3
top,Iris-setosa
freq,50


In [84]:
df.describe(include=['number'])

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
count,150.0,150.0,150.0,150.0,150.0
mean,75.5,5.843333,3.054,3.758667,1.198667
std,43.445368,0.828066,0.433594,1.76442,0.763161
min,1.0,4.3,2.0,1.0,0.1
25%,38.25,5.1,2.8,1.6,0.3
50%,75.5,5.8,3.0,4.35,1.3
75%,112.75,6.4,3.3,5.1,1.8
max,150.0,7.9,4.4,6.9,2.5


In [85]:
# for all the columns

df.describe(include='all')

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
count,150.0,150.0,150.0,150.0,150.0,150
unique,,,,,,3
top,,,,,,Iris-setosa
freq,,,,,,50
mean,75.5,5.843333,3.054,3.758667,1.198667,
std,43.445368,0.828066,0.433594,1.76442,0.763161,
min,1.0,4.3,2.0,1.0,0.1,
25%,38.25,5.1,2.8,1.6,0.3,
50%,75.5,5.8,3.0,4.35,1.3,
75%,112.75,6.4,3.3,5.1,1.8,


In [86]:
df.corr()

ValueError: could not convert string to float: 'Iris-setosa'

In [87]:
df.skew()

TypeError: could not convert string to float: 'Iris-setosa'

In [88]:
df.kurt()

TypeError: could not convert string to float: 'Iris-setosa'

#### These value errors tells us that we need to select only valid columns, i.e, numerical columns in order to use these functions.

In [89]:
# Remember the num_cols we created earlier? Let's create something like that

num_cols = df.select_dtypes(include=['float64']).columns

In [90]:
df[num_cols].kurt()

SepalLengthCm   -0.552064
SepalWidthCm     0.290781
PetalLengthCm   -1.401921
PetalWidthCm    -1.339754
dtype: float64

In [91]:
df[num_cols].skew()

SepalLengthCm    0.314911
SepalWidthCm     0.334053
PetalLengthCm   -0.274464
PetalWidthCm    -0.104997
dtype: float64

In [92]:
df[num_cols].corr()

Unnamed: 0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm
SepalLengthCm,1.0,-0.109369,0.871754,0.817954
SepalWidthCm,-0.109369,1.0,-0.420516,-0.356544
PetalLengthCm,0.871754,-0.420516,1.0,0.962757
PetalWidthCm,0.817954,-0.356544,0.962757,1.0


### DataFrame.agg() Method

In [93]:
df.columns

Index(['Id', 'SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm',
       'Species'],
      dtype='object')

In [95]:
df.agg(
    {
      'SepalLengthCm': ['min', 'max', 'count'],
      'PetalLengthCm': ['min', 'max', 'mean', 'count'],
      'Species': ['count']
   }
)

Unnamed: 0,SepalLengthCm,PetalLengthCm,Species
min,4.3,1.0,
max,7.9,6.9,
count,150.0,150.0,150.0
mean,,3.758667,


### Accessing Data in a DataFrame Using Indexing and Slicing in Pandas DataFrame

- When selecting subsets of data, [] are use
- Inside these brackets you can use single column/row label, a list of column/row labels, a slice of labels, a conditional expression, or a colon.


### Reading the .csv File - Weather DataSet

In [6]:
df = pd.read_csv('nyc_weather2.csv')

df

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
0,1-1-2016,42,34,38.0,0.00,0.0,0
1,2-1-2016,40,32,36.0,0.00,0.0,0
2,3-1-2016,45,35,40.0,0.00,0.0,0
3,4-1-2016,36,14,25.0,0.00,0.0,0
4,5-1-2016,29,11,20.0,0.00,0.0,0
...,...,...,...,...,...,...,...
361,27-12-2016,60,40,50.0,0,0,0
362,28-12-2016,40,34,37.0,0,0,0
363,29-12-2016,46,33,39.5,0.39,0,0
364,30-12-2016,40,33,36.5,0.01,T,0


In [7]:
print("Shape of dataset: ", df.shape)
print("Features: ", df.columns)

Shape of dataset:  (366, 7)
Features:  Index(['date', 'maximum temperature', 'minimum temperature',
       'average temperature', 'precipitation', 'snow fall', 'snow depth'],
      dtype='object')


In [8]:
df.dtypes

date                    object
maximum temperature      int64
minimum temperature      int64
average temperature    float64
precipitation           object
snow fall               object
snow depth              object
dtype: object

In [9]:
df.describe()

Unnamed: 0,maximum temperature,minimum temperature,average temperature
count,366.0,366.0,366.0
mean,64.625683,49.806011,57.215847
std,18.041787,16.570747,17.12476
min,15.0,-1.0,7.0
25%,50.0,37.25,44.0
50%,64.5,48.0,55.75
75%,81.0,65.0,73.5
max,96.0,81.0,88.5


In [11]:
df.count()

date                   366
maximum temperature    366
minimum temperature    366
average temperature    366
precipitation          366
snow fall              366
snow depth             366
dtype: int64

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 366 entries, 0 to 365
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   date                 366 non-null    object 
 1   maximum temperature  366 non-null    int64  
 2   minimum temperature  366 non-null    int64  
 3   average temperature  366 non-null    float64
 4   precipitation        366 non-null    object 
 5   snow fall            366 non-null    object 
 6   snow depth           366 non-null    object 
dtypes: float64(1), int64(2), object(4)
memory usage: 20.1+ KB


### Filtering Single vs Multiple Columns from a DataFrame

To select a single column, use square brackets, [] with the column name of the column interest.

In [13]:
max_temp_df = df['maximum temperature']
max_temp_df

0      42
1      40
2      45
3      36
4      29
       ..
361    60
362    40
363    46
364    40
365    44
Name: maximum temperature, Length: 366, dtype: int64

In [15]:
print('Type: ', type(max_temp_df))
print('Shape: ', max_temp_df.shape)

Type:  <class 'pandas.core.series.Series'>
Shape:  (366,)


In [16]:
# the above selection returned a series data structure, to return a pandas df let's do this

max_temp_df = df[['maximum temperature']]

max_temp_df

Unnamed: 0,maximum temperature
0,42
1,40
2,45
3,36
4,29
...,...
361,60
362,40
363,46
364,40


In [17]:
# selecting multiple columns

max_temp_df = df[['maximum temperature', 'minimum temperature']]

max_temp_df

Unnamed: 0,maximum temperature,minimum temperature
0,42,34
1,40,32
2,45,35
3,36,14
4,29,11
...,...,...
361,60,40
362,40,34
363,46,33
364,40,33


### Filtering Rows From a DataFrame

There are several ways to do this:
- Selecting rows using slicing operation <br>
df[starting_row_index:ending_row_index:step]

- Boolean indexes. To select rows based on a conditional expression, use a condition inside the [] <br>
df[condition]

In [18]:
# Using way1 (Slicing)

df[1:3]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
1,2-1-2016,40,32,36.0,0.0,0.0,0
2,3-1-2016,45,35,40.0,0.0,0.0,0


In [19]:
# Using way2 (Boolean indexes)

df['maximum temperature']>90

0      False
1      False
2      False
3      False
4      False
       ...  
361    False
362    False
363    False
364    False
365    False
Name: maximum temperature, Length: 366, dtype: bool

In [20]:
# Applying

df[df['maximum temperature']>90]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
148,28-5-2016,92,71,81.5,0.00,0.0,0
187,6-7-2016,91,75,83.0,0,0.0,0
199,18-7-2016,93,72,82.5,0.35,0.0,0
203,22-7-2016,94,74,84.0,0,0.0,0
204,23-7-2016,96,80,88.0,0,0.0,0
205,24-7-2016,94,75,84.5,0,0.0,0
206,25-7-2016,93,73,83.0,1,0.0,0
208,27-7-2016,91,74,82.5,0,0.0,0
209,28-7-2016,95,75,85.0,T,0.0,0
223,11-8-2016,91,74,82.5,0.15,0.0,0


In [24]:
# Still on way2 using 'isin' method

df[df['date'].isin(['10-4-2016', '10-5-2016', '10-6-2016'])]

Unnamed: 0,date,maximum temperature,minimum temperature,average temperature,precipitation,snow fall,snow depth
100,10-4-2016,50,31,40.5,0.0,0.0,0
130,10-5-2016,63,50,56.5,0.0,0.0,0
161,10-6-2016,77,57,67.0,0.0,0.0,0


Where 'isin()' is a conditional function similar to the boolean process we used above. The code above is similar to using the | (or) operator to say this date or this or this for the dates.

It is important to remember that you cannot use or/and, you can only use |/& to represent the or/and functionalities.

### Filtering Specific Rows and Columns from a DataFrame
To subset both rows and columns at a go, just using selection brackets [] will not be sufficient. Here, loc or 'iloc' operators are required in front of the selection brackets.


#### Syntax:
df.loc[row_label, col_label] ==> label-based accessing <br>
df.iloc[row_index, col_label] ==> index-based accessing

In [25]:
# This accesses the 100th row index

df.loc[100]

date                   10-4-2016
maximum temperature           50
minimum temperature           31
average temperature         40.5
precipitation               0.00
snow fall                    0.0
snow depth                     0
Name: 100, dtype: object

In [26]:
# Selects the data value for the 100th row index

df.loc[100, 'date']

'10-4-2016'

In [27]:
df.loc[100, ['date', 'snow fall']]

date         10-4-2016
snow fall          0.0
Name: 100, dtype: object

In [28]:
# iloc behaves similar to df.loc[100]
df.iloc[100]

date                   10-4-2016
maximum temperature           50
minimum temperature           31
average temperature         40.5
precipitation               0.00
snow fall                    0.0
snow depth                     0
Name: 100, dtype: object

In [29]:
df.iloc[100, [0,5]]

date         10-4-2016
snow fall          0.0
Name: 100, dtype: object

In [30]:
# slicing with labels

df.loc[10:15, 'minimum temperature':'precipitation']

Unnamed: 0,minimum temperature,average temperature,precipitation
10,26,33.0,0.00
11,25,34.5,0.00
12,22,26.0,0.00
13,22,30.0,0.00
14,34,42.5,T
15,42,47.0,0.24


In [31]:
# slicing with indexes

df.iloc[10:15, 2:5]

Unnamed: 0,minimum temperature,average temperature,precipitation
10,26,33.0,0.00
11,25,34.5,0.00
12,22,26.0,0.00
13,22,30.0,0.00
14,34,42.5,T


### Accessing rows based on condition

df.loc[condition, col_labels] <br>

Accessing rows based on multiple condtions <br>
df.loc[(cond1)&(cond2) | (cond3)&(conda4)

In [33]:
# selecting dates with the temp > 90

filtering = df.loc[df['maximum temperature']>90, 'date']

filtering

148    28-5-2016
187     6-7-2016
199    18-7-2016
203    22-7-2016
204    23-7-2016
205    24-7-2016
206    25-7-2016
208    27-7-2016
209    28-7-2016
223    11-8-2016
224    12-8-2016
225    13-8-2016
226    14-8-2016
227    15-8-2016
241    29-8-2016
252     9-9-2016
257    14-9-2016
Name: date, dtype: object

In [34]:
type(filtering)

pandas.core.series.Series

In [35]:
# converting to a numpy array

filtering.to_numpy()

array(['28-5-2016', '6-7-2016', '18-7-2016', '22-7-2016', '23-7-2016',
       '24-7-2016', '25-7-2016', '27-7-2016', '28-7-2016', '11-8-2016',
       '12-8-2016', '13-8-2016', '14-8-2016', '15-8-2016', '29-8-2016',
       '9-9-2016', '14-9-2016'], dtype=object)

In [36]:
df.loc[df['maximum temperature']>90, ['date', 'snow fall']]

Unnamed: 0,date,snow fall
148,28-5-2016,0.0
187,6-7-2016,0.0
199,18-7-2016,0.0
203,22-7-2016,0.0
204,23-7-2016,0.0
205,24-7-2016,0.0
206,25-7-2016,0.0
208,27-7-2016,0.0
209,28-7-2016,0.0
223,11-8-2016,0.0


In [37]:
#to retrieve data (dates) where we had a max tmep > 90 and precipitation = T

df.loc[(df['maximum temperature']>90) & (df['precipitation']=='T'), 'date']

209    28-7-2016
Name: date, dtype: object

### Renaming columns, modifying data types, creating new columns and deleting columns in pandas dataframe

### **BEWARE!!!!** ❌❌

Due to the size of the data (about 541,000 rows), it may take some time to import completely.

In [39]:
df = pd.read_csv('retail_store_sales.csv')

In [40]:
df.shape

(541909, 8)

In [41]:
df.columns

Index(['Invoice No', ' Stock-Code ', 'Description', 'Quantity', 'Invoice Date',
       'Unit Price', 'Customer ID', 'Country'],
      dtype='object')

In [42]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   Invoice No    541909 non-null  object 
 1    Stock-Code   541909 non-null  object 
 2   Description   540455 non-null  object 
 3   Quantity      541909 non-null  int64  
 4   Invoice Date  541909 non-null  object 
 5   Unit Price    541909 non-null  float64
 6   Customer ID   406829 non-null  float64
 7   Country       541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [57]:
df.dtypes

Invoice No       object
 Stock-Code      object
Description      object
Quantity          int64
Invoice Date     object
Unit Price      float64
Customer ID     float64
Country          object
dtype: object

In [44]:
print('Total sales record: ', df.shape[0])

print('Total unique customers: ', df['Customer ID'].nunique())

print('Date range: ', df['Invoice Date'].min(), 'to', df['Invoice Date'].max())

Total sales record:  541909
Total unique customers:  4372
Date range:  1/10/2011 10:04 to 9/9/2011 9:52


In [45]:
#checking for unique countries

df['Country'].unique()

array(['United Kingdom', 'France', 'Australia', 'Netherlands', 'Germany',
       'Norway', 'EIRE', 'Switzerland', 'Spain', 'Poland', 'Portugal',
       'Italy', 'Belgium', 'Lithuania', 'Japan', 'Iceland',
       'Channel Islands', 'Denmark', 'Cyprus', 'Sweden', 'Austria',
       'Israel', 'Finland', 'Bahrain', 'Greece', 'Hong Kong', 'Singapore',
       'Lebanon', 'United Arab Emirates', 'Saudi Arabia',
       'Czech Republic', 'Canada', 'Unspecified', 'Brazil', 'USA',
       'European Community', 'Malta', 'RSA'], dtype=object)

### In Pandas, both `.unique()` and `.nunique()` are used to work with distinct values in a Series or DataFrame column—but they return different types of results:

### unique()
- Returns an array of all unique/distinct values in the Series.
- Good when you want to see what the unique values are.
- Output type: `numpy.ndarray`

**Example**:

import pandas as pd

s = pd.Series([1, 2, 2, 3, 3, 3])

print(s.unique())

**Output**:

[1 2 3]


### nunique()
- Returns a single number: the count of unique values.
- Good when you only care about how many unique values there are.
- Output type: `int`

**Example**:

print(s.nunique())

**Output**:

3

In [46]:
#total sales record for each country

df['Country'].value_counts()

Country
United Kingdom          495478
Germany                   9495
France                    8557
EIRE                      8196
Spain                     2533
Netherlands               2371
Belgium                   2069
Switzerland               2002
Portugal                  1519
Australia                 1259
Norway                    1086
Italy                      803
Channel Islands            758
Finland                    695
Cyprus                     622
Sweden                     462
Unspecified                446
Austria                    401
Denmark                    389
Japan                      358
Poland                     341
Israel                     297
USA                        291
Hong Kong                  288
Singapore                  229
Iceland                    182
Canada                     151
Greece                     146
Malta                      127
United Arab Emirates        68
European Community          61
RSA                         58


### Renaming columns
The rename() function can be used for both row labels and column labels. What you need to do is provide a dictionary with the keys as the current names and the values as the new names.

**Syntax:** <br>
df.rename(index = None, columns=None)

In [47]:
df.columns

Index(['Invoice No', ' Stock-Code ', 'Description', 'Quantity', 'Invoice Date',
       'Unit Price', 'Customer ID', 'Country'],
      dtype='object')

In [50]:
columns = {'Description': 'Production Description', 'Customer ID': 'Cust ID'}

df_renamed = df.rename(columns=columns)

df_renamed.columns

Index(['Invoice No', ' Stock-Code ', 'Production Description', 'Quantity',
       'Invoice Date', 'Unit Price', 'Cust ID', 'Country'],
      dtype='object')

In [51]:
df_renamed.head()

Unnamed: 0,Invoice No,Stock-Code,Production Description,Quantity,Invoice Date,Unit Price,Cust ID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


While renaming columns we can strip extra spaces, convert to lower cases, and remove all special characters including spaces. This would allow us to use dot to access the properties of a python object.

In [53]:
col_names = [col.strip().lower().replace(' ', '_') for col in df_renamed.columns]
col_names

['invoice_no',
 'stock-code',
 'production_description',
 'quantity',
 'invoice_date',
 'unit_price',
 'cust_id',
 'country']

In [54]:
df_renamed.columns = col_names
df_renamed.columns

Index(['invoice_no', 'stock-code', 'production_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country'],
      dtype='object')

In [55]:
df_renamed.head()

Unnamed: 0,invoice_no,stock-code,production_description,quantity,invoice_date,unit_price,cust_id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


### Modifying column data type

There are several ways to do this:

1. Modifying the data type using **DataFrame.astype()** <br>
Here, we can pass any python, numpy or pandas datatype to change all columns of a dataframe to that type OR we pass a dictionary having column names as keys and the data type as values to change the selected columns.

2. Modifying datatype using **DataFrame.apply()** <br>
Over here, we can pass `pandas.to_numeric`, `pandas.to_datetime`, and `pandas.to_timedelta` as arguments to the `apply()` function to change the data type of one or more columns to numeric, datetime and timedelta respectively. 



In [56]:
## Converting all columns to string data type

df_renamed = df_renamed.astype(str)
df_renamed.dtypes

invoice_no                object
stock-code                object
production_description    object
quantity                  object
invoice_date              object
unit_price                object
cust_id                   object
country                   object
dtype: object

In [60]:
df_renamed[['quantity', 'unit_price', 'cust_id']] = df_renamed[['quantity', 'unit_price', 'cust_id']].astype('float64')
df_renamed.dtypes

invoice_no                 object
stock-code                 object
production_description     object
quantity                  float64
invoice_date               object
unit_price                float64
cust_id                   float64
country                    object
dtype: object

In [62]:
## using a dictionary to convert specific columns
convert_dict = {'quantity': int, 'country': str}
df_renamed = df_renamed.astype(convert_dict)
df_renamed.dtypes

invoice_no                 object
stock-code                 object
production_description     object
quantity                    int64
invoice_date               object
unit_price                float64
cust_id                   float64
country                    object
dtype: object

In [66]:
## Using DataFrame.apply()
df_renamed['invoice_date'] = df_renamed['invoice_date'].apply(pd.to_datetime)

df_renamed.dtypes

#the code takes a long while to run due to the size of the data

invoice_no                        object
stock-code                        object
production_description            object
quantity                           int64
invoice_date              datetime64[ns]
unit_price                       float64
cust_id                          float64
country                           object
invoice_dat               datetime64[ns]
amount                           float64
dtype: object

### Creating a derived column

In [68]:
## Creating a derived column called amount
df_renamed['amount'] = df_renamed['quantity'] * df_renamed['unit_price']
df_renamed.head()

Unnamed: 0,invoice_no,stock-code,production_description,quantity,invoice_date,unit_price,cust_id,country,invoice_dat,amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01 08:26:00,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01 08:26:00,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34


### Creating columns using `apply()` function

1. **For a dataframe** <br>
df.apply(function, axis = 0) rememeber that axis = 0 means column-wise, which means it applies the function column wise

2. **For Series** <br>
series.apply(function, axis=0) Applies the function element wise

**Note**:

- axis = 0 -- applies function for each column <br>
- axis = 1 -- aplies function for each row

In [69]:
#Using numpy
df_renamed.apply(np.max)

#applies the max function for each column of the dataset

invoice_no                            C581569
stock-code                                  m
production_description      wrongly sold sets
quantity                                80995
invoice_date              2011-12-09 12:50:00
unit_price                            38970.0
cust_id                               18287.0
country                           Unspecified
invoice_dat               2011-12-09 12:50:00
amount                               168469.6
dtype: object

In [70]:
#applying a function to a specific column
df_renamed[['amount']].apply(np.mean)

amount    17.987795
dtype: float64

In [71]:
#the above code can be concisely written as
df_renamed['amount'].mean()

np.float64(17.98779487699964)

In [73]:
#lets recreate the amount table, this time we call it new_amount
#we will do this using the apply function
df_renamed['new_amount'] = df_renamed.apply(lambda row: row['quantity'] * row['unit_price'], axis=1)
df_renamed.head()

Unnamed: 0,invoice_no,stock-code,production_description,quantity,invoice_date,unit_price,cust_id,country,invoice_dat,amount,new_amount
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01 08:26:00,15.3,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01 08:26:00,22.0,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,20.34


In [74]:
#creating a new column called 'new_amt_with_taxes'
#we assume a 15% tax on each product
df_renamed['new_amt_with_taxes'] = df_renamed['new_amount'].apply(lambda col: col*0.15)
df_renamed.head()

Unnamed: 0,invoice_no,stock-code,production_description,quantity,invoice_date,unit_price,cust_id,country,invoice_dat,amount,new_amount,new_amt_with_taxes
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,2010-12-01 08:26:00,15.3,15.3,2.295
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,20.34,3.051
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,2010-12-01 08:26:00,22.0,22.0,3.3
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,20.34,3.051
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,20.34,3.051


### Deleting columns in a dataframe
There are several ways to delete a column(s) from a dataframe

1. Dropping columns by using column names
==>dropping two columns

`df.drop(['col1', 'col2'], axis = 1, inplace = True)`

The inplace parameter performs the operation and saves the result back to the df

2. Removing columns by using column name using loc[] Useful when you want to remove columns within a range
==>removing all columns between col3 and col7

`df.drop(df.loc[:, 'col3':'col7'], axis = 1, inplace = True)`

3. Removing columns based on index
==>removing three columns as index base

`df.drop(df.columns[[0,4,2]], axis = 1, inplace = True)`

4. Removing columns based on index using iloc[]
==>removing two columns between column index 1 to 3

`df.drop(df.iloc[:, 1:3], axis=1, inplace=True)`

5. Dataframe.pop() method
Using pop(), you can delete a single column at a time. It applies changes to the dataframe without any need of `inplace=True`

`df.pop(col4)`

In [75]:
df_renamed.columns

Index(['invoice_no', 'stock-code', 'production_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'invoice_dat',
       'amount', 'new_amount', 'new_amt_with_taxes'],
      dtype='object')

In [77]:
#Syntax 1
df_renamed.drop(['amount'], axis=1)
df_renamed.columns

Index(['invoice_no', 'stock-code', 'production_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'invoice_dat',
       'amount', 'new_amount', 'new_amt_with_taxes'],
      dtype='object')

In [78]:
#the amount column was not removed
#to ensure the changes are saved, use inplace = True

df_renamed.drop(['amount'], axis = 1, inplace=True)
df_renamed.columns

Index(['invoice_no', 'stock-code', 'production_description', 'quantity',
       'invoice_date', 'unit_price', 'cust_id', 'country', 'invoice_dat',
       'new_amount', 'new_amt_with_taxes'],
      dtype='object')

In [79]:
#Syntax 2
df_renamed.drop(df_renamed.loc[:, 'invoice_no': 'invoice_date'], axis=1)

Unnamed: 0,unit_price,cust_id,country,invoice_dat,new_amount,new_amt_with_taxes
0,2.55,17850.0,United Kingdom,2010-12-01 08:26:00,15.30,2.2950
1,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,3.0510
2,2.75,17850.0,United Kingdom,2010-12-01 08:26:00,22.00,3.3000
3,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,3.0510
4,3.39,17850.0,United Kingdom,2010-12-01 08:26:00,20.34,3.0510
...,...,...,...,...,...,...
541904,0.85,12680.0,France,2011-12-09 12:50:00,10.20,1.5300
541905,2.10,12680.0,France,2011-12-09 12:50:00,12.60,1.8900
541906,4.15,12680.0,France,2011-12-09 12:50:00,16.60,2.4900
541907,4.15,12680.0,France,2011-12-09 12:50:00,16.60,2.4900


In [83]:
# Syntax 3
df_renamed.drop(df_renamed.columns[[0,4,2]], axis=1)

Unnamed: 0,cust_id,new_amount
0,17850.0,15.30
1,17850.0,20.34
2,17850.0,22.00
3,17850.0,20.34
4,17850.0,20.34
...,...,...
541904,12680.0,10.20
541905,12680.0,12.60
541906,12680.0,16.60
541907,12680.0,16.60


In [84]:
# Syntax 4
df_renamed.drop(df_renamed.iloc[:, 1:3], axis=1)

Unnamed: 0,quantity,new_amount,new_amt_with_taxes
0,6,15.30,2.2950
1,6,20.34,3.0510
2,8,22.00,3.3000
3,6,20.34,3.0510
4,6,20.34,3.0510
...,...,...,...
541904,12,10.20,1.5300
541905,6,12.60,1.8900
541906,4,16.60,2.4900
541907,4,16.60,2.4900


In [85]:
df_renamed.columns

Index(['quantity', 'cust_id', 'invoice_dat', 'new_amount',
       'new_amt_with_taxes'],
      dtype='object')

In [87]:
# Syntax 5
df_renamed.pop('quantity')

0          6
1          6
2          8
3          6
4          6
          ..
541904    12
541905     6
541906     4
541907     4
541908     3
Name: quantity, Length: 541909, dtype: int64

In [88]:
df_renamed.columns

Index(['cust_id', 'invoice_dat', 'new_amt_with_taxes'], dtype='object')

### Adding/Inserting Rows

In [91]:
df = pd.read_csv('weather_data.csv')

### Inserting row(s) using dictionary - `pandas.concat()`

1. Inserting a single row
- To create a new record using dictionary <br>
`new_record = pd.DataFrame( [{'day': '1/7/2019', 'temperature': 45, 'windspeed': 8, 'event': 'Sunny'}] )`

- To insert a row at the end <br>
`df = pd.concat([df, new_record], ignore_index=True)`

- To insert a row at the top <br>
`df = pd.concat([new_record, df], ignore_index= True)`

2. Inserting multiple rows (that is, a batch of data)
- To create a new record using dictionary <br>
`batch_records = pd.DataFrame([`

              `{'day': '1/7/2019', 'temperature': 45, 'windspeed': 8, 'event': 'Sunny'}`,

              `{'day': '1/9/2019', 'temperature': 40, 'windspeed': 3, 'event': 'Snow'}`

            `])`
- To insert a row at the end
`df = pd.concat([df, batch_records], ignore_index=True)`

- To insert a row at the top
`df = pd.concat([batch_records, df], ignore_index= True)`

In [92]:
#creating a new record
new_record = pd.DataFrame(
    [{'day': '1/7/2019', 'temperature': 45, 'windspeed': 8, 'event': 'Sunny'}]
)

#inserting at the end
df = pd.concat([df, new_record], ignore_index=True)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2019,45,8,Sunny


In [93]:
#inserting multiple records
batch_records = pd.DataFrame(
    [
        {'day': '1/8/2019', 'temperature': 35, 'windspeed': 5, 'event': 'Sunny'},
        {'day': '1/9/2019', 'temperature': 40, 'windspeed': 3, 'event': 'Snow'}
    ]
)

#inserting at the end
df = pd.concat([df, batch_records], ignore_index=True)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2019,45,8,Sunny
7,1/8/2019,35,5,Sunny
8,1/9/2019,40,3,Snow


### Inserting a row using list - `.loc[] and iloc[]`

This works differently from using a dictionary. We cannot simply use the concat() function. Here we need to use the loc accessor. The label we use for our loc accessor will be the length of the dataframe. This creates a new row.

1. Using DataFrame.loc[]

`df.loc[len(df)] = ['1/12/2017', 28, 2, 'Rain']`

2. Using DataFrame.iloc[]

This generates an error because iloc can't be used to add new rows

In [94]:
#inserting row with .loc() function using a list
df.loc[len(df)] = ['1/12/2017', 28, 2, 'Rain']
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2019,45,8,Sunny
7,1/8/2019,35,5,Sunny
8,1/9/2019,40,3,Snow
9,1/12/2017,28,2,Rain


In [95]:
# this will generate an error because we cannot insert a row using .iloc() function
df.iloc[len(df)] = ['1/12/2017', 28, 2, 'Rain']
df

IndexError: iloc cannot enlarge its target object

The error occurs because you are trying to assign a new row to a DataFrame using .iloc with an index that does not exist. By design, iloc is for accessing and modifying existing rows and columns, and it cannot be used to expand a DataFrame by adding new rows.

In this case, len(df) is equal to the total number of rows in the DataFrame, which represents an out-of-bounds index because indexing in pandas is zero-based. To add a new row to a pandas DataFrame, use one of the following methods:

- ` loc`
- `pd.concat`
- Rebuild the DataFrame After Adding a Row 👇

`data = df.values.tolist()  # Convert existing DataFrame to list of lists`

`data.append(['1/12/2017', 28, 2, 'Rain'])  # Add new row`

`df = pd.DataFrame(data, columns=df.columns)  # Recreate DataFrame`

In [101]:
data = df.values.tolist()  # Convert existing DataFrame to list of lists
data

[['1/1/2017', 32, 6, 'Rain'],
 ['1/2/2017', 35, 7, 'Sunny'],
 ['1/3/2017', 28, 2, 'Snow'],
 ['1/4/2017', 24, 7, 'Snow'],
 ['1/5/2017', 32, 4, 'Rain'],
 ['1/6/2017', 31, 2, 'Sunny'],
 ['1/7/2019', 45, 8, 'Sunny'],
 ['1/8/2019', 35, 5, 'Sunny'],
 ['1/9/2019', 40, 3, 'Snow'],
 ['1/12/2017', 28, 2, 'Rain'],
 ['1/12/2017', 28, 2, 'Rain']]

In [100]:
data.append(['1/12/2017', 28, 2, 'Rain'])  # Add new row
data

[['1/1/2017', 32, 6, 'Rain'],
 ['1/2/2017', 35, 7, 'Sunny'],
 ['1/3/2017', 28, 2, 'Snow'],
 ['1/4/2017', 24, 7, 'Snow'],
 ['1/5/2017', 32, 4, 'Rain'],
 ['1/6/2017', 31, 2, 'Sunny'],
 ['1/7/2019', 45, 8, 'Sunny'],
 ['1/8/2019', 35, 5, 'Sunny'],
 ['1/9/2019', 40, 3, 'Snow'],
 ['1/12/2017', 28, 2, 'Rain'],
 ['1/12/2017', 28, 2, 'Rain'],
 ['1/12/2017', 28, 2, 'Rain']]

In [102]:
df = pd.DataFrame(data, columns=df.columns)  # Recreate DataFrame
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2019,45,8,Sunny
7,1/8/2019,35,5,Sunny
8,1/9/2019,40,3,Snow
9,1/12/2017,28,2,Rain


### Inserting a row at a specific index of a dataframe

To do this we need to pass a value closest to the point where we want to insert the row.
For example, if I want to insert a row between 8 and 9, I can use 8.5 in df.loc[]

**Syntax**

`df.loc[8.5] = ['1/10/2017', 30, 3, 'Rain']`

_then sort the index_

`df = df.sort_index().reset_index(drop=True)`

`df`

In [105]:
# inserting a row at a specific index of a dataframe
df.loc[8.5] = ['1/10/2017', 30, 3, 'Rain']

#sorting index
df = df.sort_index().reset_index(drop=True)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny
6,1/7/2019,45,8,Sunny
7,1/8/2019,35,5,Sunny
8,1/9/2019,40,3,Snow
9,1/10/2017,30,3,Rain


## Handling timeseries data

Pandas has great support for time series analysis and data.

- Valid date strings can be convertd to datetime objects using `to_datetime` function or as part of read function.

- `pandas.Datetime` objects support calculations, logical operations and convenient date-related properties usng the `dt` accessor like year, month, day, day_of_week, is_leap_year, etc.

- Using the `dt` accessor again, we can access datetime methods like day_name(), month_name(), etc

- `pandas.Timedelta` represents a duration. It is the difference between two dates or times. Many properties of timedelta can be accessed using dt like components, days, seconds, etc

- We can access timedelta methods using dt accessor like total_seconds()


In [110]:
df = pd.read_csv('online_store_sales.csv')

In [111]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Row ID         9800 non-null   int64  
 1   Order ID       9800 non-null   object 
 2   Order Date     9800 non-null   object 
 3   Ship Date      9800 non-null   object 
 4   Ship Mode      9800 non-null   object 
 5   Customer ID    9800 non-null   object 
 6   Customer Name  9800 non-null   object 
 7   Segment        9800 non-null   object 
 8   Country        9800 non-null   object 
 9   City           9800 non-null   object 
 10  State          9800 non-null   object 
 11  Postal Code    9789 non-null   float64
 12  Region         9800 non-null   object 
 13  Product ID     9800 non-null   object 
 14  Category       9800 non-null   object 
 15  Sub-Category   9800 non-null   object 
 16  Product Name   9800 non-null   object 
 17  Sales          9800 non-null   float64
dtypes: float

In [115]:
df.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales
0,1,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,08/11/2017,11/11/2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,12/06/2017,16/06/2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,11/10/2016,18/10/2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


In [None]:
## We need to convert the date columns to datetime datatype

#df[['Order Date', 'Ship Date']].apply(pd.to_datetime)

Running the code above would result to a parsing error/warning
To avoid this we need to specify a date format. This is especially because date formats come in various forms

To get rid of this warning there are two ways to tackle it:
1. Add parameter `dayfirst=True`

2. Add parameter `format="%d%m%Y"`

In [113]:
pd.to_datetime(df['Ship Date'], dayfirst = True)

0      2017-11-11
1      2017-11-11
2      2017-06-16
3      2016-10-18
4      2016-10-18
          ...    
9795   2017-05-28
9796   2016-01-17
9797   2016-01-17
9798   2016-01-17
9799   2016-01-17
Name: Ship Date, Length: 9800, dtype: datetime64[ns]

In [117]:
pd.to_datetime(df['Ship Date'], format = "%d/%m/%Y")

0      2017-11-11
1      2017-11-11
2      2017-06-16
3      2016-10-18
4      2016-10-18
          ...    
9795   2017-05-28
9796   2016-01-17
9797   2016-01-17
9798   2016-01-17
9799   2016-01-17
Name: Ship Date, Length: 9800, dtype: datetime64[ns]

In [119]:
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format = "%d/%m/%Y")
df['Order Date'] = pd.to_datetime(df['Order Date'], format = "%d/%m/%Y")

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         9800 non-null   int64         
 1   Order ID       9800 non-null   object        
 2   Order Date     9800 non-null   datetime64[ns]
 3   Ship Date      9800 non-null   datetime64[ns]
 4   Ship Mode      9800 non-null   object        
 5   Customer ID    9800 non-null   object        
 6   Customer Name  9800 non-null   object        
 7   Segment        9800 non-null   object        
 8   Country        9800 non-null   object        
 9   City           9800 non-null   object        
 10  State          9800 non-null   object        
 11  Postal Code    9789 non-null   float64       
 12  Region         9800 non-null   object        
 13  Product ID     9800 non-null   object        
 14  Category       9800 non-null   object        
 15  Sub-Category   9800 n

We can also carry out this formatting or transformation on the date columns when reading the data into a dataframe using the parse_dates argument.

`pd.read_csv(data, parse_dates=['cols']`

`pd.read_json(data, parse_dates=['cols']`

In [122]:
df = pd.read_csv('online_store_sales.csv', parse_dates=['Order Date', 'Ship Date'], dayfirst=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   Row ID         9800 non-null   int64         
 1   Order ID       9800 non-null   object        
 2   Order Date     9800 non-null   datetime64[ns]
 3   Ship Date      9800 non-null   datetime64[ns]
 4   Ship Mode      9800 non-null   object        
 5   Customer ID    9800 non-null   object        
 6   Customer Name  9800 non-null   object        
 7   Segment        9800 non-null   object        
 8   Country        9800 non-null   object        
 9   City           9800 non-null   object        
 10  State          9800 non-null   object        
 11  Postal Code    9789 non-null   float64       
 12  Region         9800 non-null   object        
 13  Product ID     9800 non-null   object        
 14  Category       9800 non-null   object        
 15  Sub-Category   9800 n

In [124]:
#renaming the columns
col_names = [col.strip().lower().replace(' ', '_').replace('-', '_') for col in df.columns]

df.columns = col_names
df.columns

Index(['row_id', 'order_id', 'order_date', 'ship_date', 'ship_mode',
       'customer_id', 'customer_name', 'segment', 'country', 'city', 'state',
       'postal_code', 'region', 'product_id', 'category', 'sub_category',
       'product_name', 'sales'],
      dtype='object')

In [125]:
print('Orders started from', df['order_date'].min(), 'till', df['order_date'].max())

Orders started from 2015-01-03 00:00:00 till 2018-12-30 00:00:00


### Working with datetime in pandas

Get year, month and day

`df['year'] = df['DoB'].dt.year`

`df['month'] = df['DoB'].dt.month`

`df['day'] = df['DoB'].dt.day`

Get the week of the year, the day of the week and leap year

`df['week_of_year'] = df['DoB'].dt.week`

`df['day_of_week'] = df['DoB'].dt.dayofweek`

`df['is_leap_year'] = df['DoB'].dt.is_leap_year`

`dw_mapping` = {
    0: 'Monday',
    1: 'Tuesday',
    2: 'Wednesday',
    3: 'Thursday',
    4: 'Friday',
    5: 'Saturday',
    6: 'Sunday'
}

`df['day_of_week_name'] = df['DoB'].dt.weekday.map(dw_mapping)`


Get the age from the date of birth

`today = pd.to_datetime('today')`

`df['age'] = today.year - df['DoB'].dt.year`


In [126]:
df['order_date'].dt.day_name()

0       Wednesday
1       Wednesday
2          Monday
3         Tuesday
4         Tuesday
          ...    
9795       Sunday
9796      Tuesday
9797      Tuesday
9798      Tuesday
9799      Tuesday
Name: order_date, Length: 9800, dtype: object

In [127]:
df['order_date'].dt.month_name()

0       November
1       November
2           June
3        October
4        October
          ...   
9795         May
9796     January
9797     January
9798     January
9799     January
Name: order_date, Length: 9800, dtype: object

In [128]:
#Calculating delivery time from the order date and ship date
df['delivery_time'] = df['ship_date'] - df['order_date']
df.head()

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,3 days
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3 days
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,4 days
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,7 days
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,7 days


In [129]:
df.tail(10)

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time
9790,9791,CA-2018-144491,2018-03-27,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10001714,Furniture,Chairs,"Global Leather & Oak Executive Chair, Burgundy",211.246,5 days
9791,9792,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,OFF-EN-10003134,Office Supplies,Envelopes,Staple envelope,56.064,2 days
9792,9793,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10003396,Furniture,Chairs,Global Deluxe Steno Chair,107.772,2 days
9793,9794,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,OFF-PA-10001560,Office Supplies,Paper,"Adams Telephone Message Books, 5 1/4” x 11”",4.832,2 days
9794,9795,CA-2015-127166,2015-05-21,2015-05-23,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,77070.0,Central,OFF-BI-10000977,Office Supplies,Binders,Ibico Plastic Spiral Binding Combs,18.24,2 days
9795,9796,CA-2017-125920,2017-05-21,2017-05-28,Standard Class,SH-19975,Sally Hughsby,Corporate,United States,Chicago,Illinois,60610.0,Central,OFF-BI-10003429,Office Supplies,Binders,"Cardinal HOLDit! Binder Insert Strips,Extra St...",3.798,7 days
9796,9797,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,OFF-AR-10001374,Office Supplies,Art,"BIC Brite Liner Highlighters, Chisel Tip",10.368,5 days
9797,9798,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.188,5 days
9798,9799,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.376,5 days
9799,9800,CA-2016-128608,2016-01-12,2016-01-17,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,43615.0,East,TEC-AC-10000487,Technology,Accessories,SanDisk Cruzer 4 GB USB Flash Drive,10.384,5 days


In [130]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 19 columns):
 #   Column         Non-Null Count  Dtype          
---  ------         --------------  -----          
 0   row_id         9800 non-null   int64          
 1   order_id       9800 non-null   object         
 2   order_date     9800 non-null   datetime64[ns] 
 3   ship_date      9800 non-null   datetime64[ns] 
 4   ship_mode      9800 non-null   object         
 5   customer_id    9800 non-null   object         
 6   customer_name  9800 non-null   object         
 7   segment        9800 non-null   object         
 8   country        9800 non-null   object         
 9   city           9800 non-null   object         
 10  state          9800 non-null   object         
 11  postal_code    9789 non-null   float64        
 12  region         9800 non-null   object         
 13  product_id     9800 non-null   object         
 14  category       9800 non-null   object         
 15  sub_

Note that timedelta64[ns] is duration, that is the difference between two dates or times. Many properties of timedelta can be accessed using the dt accessor along with components of interest.

In [131]:
df['delivery_time_days'] = df['delivery_time'].dt.days

df.head(3)

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,3 days,3
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3 days,3
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,4 days,4


In [132]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype          
---  ------              --------------  -----          
 0   row_id              9800 non-null   int64          
 1   order_id            9800 non-null   object         
 2   order_date          9800 non-null   datetime64[ns] 
 3   ship_date           9800 non-null   datetime64[ns] 
 4   ship_mode           9800 non-null   object         
 5   customer_id         9800 non-null   object         
 6   customer_name       9800 non-null   object         
 7   segment             9800 non-null   object         
 8   country             9800 non-null   object         
 9   city                9800 non-null   object         
 10  state               9800 non-null   object         
 11  postal_code         9789 non-null   float64        
 12  region              9800 non-null   object         
 13  product_id          9800 non-null

In [133]:
df['delivery_time'].dt.components

Unnamed: 0,days,hours,minutes,seconds,milliseconds,microseconds,nanoseconds
0,3,0,0,0,0,0,0
1,3,0,0,0,0,0,0
2,4,0,0,0,0,0,0
3,7,0,0,0,0,0,0
4,7,0,0,0,0,0,0
...,...,...,...,...,...,...,...
9795,7,0,0,0,0,0,0
9796,5,0,0,0,0,0,0
9797,5,0,0,0,0,0,0
9798,5,0,0,0,0,0,0


In [134]:
df['delivery_time'].dt.total_seconds()

0       259200.0
1       259200.0
2       345600.0
3       604800.0
4       604800.0
          ...   
9795    604800.0
9796    432000.0
9797    432000.0
9798    432000.0
9799    432000.0
Name: delivery_time, Length: 9800, dtype: float64

### Improving performance by setting the date column as the index

`df = df.set_index(['date'])`

#### Modifying the index inplace

`df = df.set_index(['date'], inplace = True)`

#### Select data with specific year and perform aggregation

**select data with a specific year**

`df.loc['2018']`


**select data with a specific day**

`df.loc['2018-5-1']`

**select data using slicing operation**

`df.loc['2018-2-1': '2018-5-1']`

**applying aggregation within a date slicing**

`df.loc['2018-2-1': '2018-5-1', ['sales']].mean()`

In [135]:
df = df.set_index(['order_date'])
df.head()

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2017-11-08,1,CA-2017-152156,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,3 days,3
2017-11-08,2,CA-2017-152156,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3 days,3
2017-06-12,3,CA-2017-138688,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,4 days,4
2016-10-11,4,US-2016-108966,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,7 days,7
2016-10-11,5,US-2016-108966,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,7 days,7


In [136]:
#filtering rows based on year
df.loc['2018']

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-04-15,13,CA-2018-114412,2018-04-20,Standard Class,AA-10480,Andrew Allen,Consumer,United States,Concord,North Carolina,28027.0,South,OFF-PA-10002365,Office Supplies,Paper,Xerox 1967,15.552,5 days,5
2018-07-16,24,US-2018-156909,2018-07-18,Second Class,SF-20065,Sandra Flanagan,Consumer,United States,Philadelphia,Pennsylvania,19140.0,East,FUR-CH-10002774,Furniture,Chairs,"Global Deluxe Stacking Chair, Gray",71.372,2 days,2
2018-10-19,35,CA-2018-107727,2018-10-23,Second Class,MA-17560,Matt Abelman,Home Office,United States,Houston,Texas,77095.0,Central,OFF-PA-10000249,Office Supplies,Paper,Easy-staple paper,29.472,4 days,4
2018-09-10,42,CA-2018-120999,2018-09-15,Standard Class,LC-16930,Linda Cazamias,Corporate,United States,Naperville,Illinois,60540.0,Central,TEC-PH-10004093,Technology,Phones,Panasonic Kx-TS550,147.168,5 days,5
2018-09-19,44,CA-2018-139619,2018-09-23,Standard Class,ES-14080,Erin Smith,Corporate,United States,Melbourne,Florida,32935.0,South,OFF-ST-10003282,Office Supplies,Storage,"Advantus 10-Drawer Portable Organizer, Chrome ...",95.616,4 days,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-12-07,9769,CA-2018-142328,2018-12-14,Standard Class,TC-21535,Tracy Collins,Home Office,United States,San Francisco,California,94122.0,West,OFF-PA-10000380,Office Supplies,Paper,"REDIFORM Incoming/Outgoing Call Register, 11"" ...",50.040,7 days,7
2018-03-27,9788,CA-2018-144491,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-BO-10001811,Furniture,Bookcases,"Atlantic Metals Mobile 5-Shelf Bookcases, Cust...",1023.332,5 days,5
2018-03-27,9789,CA-2018-144491,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10004063,Furniture,Chairs,Global Deluxe High-Back Manager's Chair,600.558,5 days,5
2018-03-27,9790,CA-2018-144491,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,TEC-AC-10004901,Technology,Accessories,Kensington SlimBlade Notebook Wireless Mouse w...,39.992,5 days,5


In [137]:
#filter for a specific date
df.loc['2018-03-27']

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-03-27,4676,CA-2018-139416,2018-03-29,Second Class,AG-10270,Alejandro Grove,Consumer,United States,Philadelphia,Pennsylvania,19120.0,East,FUR-FU-10003832,Furniture,Furnishings,Eldon Expressions Punched Metal & Wood Desk Ac...,15.008,2 days,2
2018-03-27,8486,CA-2018-124716,2018-03-31,Standard Class,BD-11560,Brendan Dodson,Home Office,United States,Fresno,California,93727.0,West,OFF-PA-10000740,Office Supplies,Paper,Xerox 1982,45.68,4 days,4
2018-03-27,8487,CA-2018-124716,2018-03-31,Standard Class,BD-11560,Brendan Dodson,Home Office,United States,Fresno,California,93727.0,West,OFF-PA-10001144,Office Supplies,Paper,Xerox 1913,110.96,4 days,4
2018-03-27,8488,CA-2018-124716,2018-03-31,Standard Class,BD-11560,Brendan Dodson,Home Office,United States,Fresno,California,93727.0,West,OFF-PA-10000859,Office Supplies,Paper,Unpadded Memo Slips,11.94,4 days,4
2018-03-27,9754,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,OFF-LA-10000476,Office Supplies,Labels,Avery 05222 Permanent Self-Adhesive File Folde...,8.26,2 days,2
2018-03-27,9755,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,OFF-BI-10001679,Office Supplies,Binders,GBC Instant Index System for Binding Systems,17.76,2 days,2
2018-03-27,9756,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,OFF-ST-10001128,Office Supplies,Storage,"Carina Mini System Audio Rack, Model AR050B",332.94,2 days,2
2018-03-27,9757,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,FUR-TA-10002533,Furniture,Tables,BPI Conference Tables,292.1,2 days,2
2018-03-27,9758,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,TEC-PH-10004006,Technology,Phones,Panasonic KX - TS880B Telephone,206.1,2 days,2
2018-03-27,9759,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,OFF-PA-10002615,Office Supplies,Paper,"Ampad Gold Fibre Wirebound Steno Books, 6"" x 9...",17.64,2 days,2


In [138]:
#filter rows based on a date slice
df.loc['2018-03-27': '2018-04-27']

KeyError: 'Value based partial slicing on non-monotonic DatetimeIndexes with non-existing keys is not allowed.'

This error indicates that you're attempting to perform date slicing on a pandas DatetimeIndex, but the DataFrame's index is not sorted chronologically. Let's fix that.

In [139]:
df.sort_index(inplace=True)

In [140]:
#filter rows based on a date slice
df.loc['2018-03-27': '2018-04-27']

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-03-27,9790,CA-2018-144491,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,TEC-AC-10004901,Technology,Accessories,Kensington SlimBlade Notebook Wireless Mouse w...,39.9920,5 days,5
2018-03-27,8488,CA-2018-124716,2018-03-31,Standard Class,BD-11560,Brendan Dodson,Home Office,United States,Fresno,California,93727.0,West,OFF-PA-10000859,Office Supplies,Paper,Unpadded Memo Slips,11.9400,4 days,4
2018-03-27,9788,CA-2018-144491,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-BO-10001811,Furniture,Bookcases,"Atlantic Metals Mobile 5-Shelf Bookcases, Cust...",1023.3320,5 days,5
2018-03-27,9754,CA-2018-113705,2018-03-29,Second Class,LC-16870,Lena Cacioppo,Consumer,United States,Richmond,Virginia,23223.0,South,OFF-LA-10000476,Office Supplies,Labels,Avery 05222 Permanent Self-Adhesive File Folde...,8.2600,2 days,2
2018-03-27,9789,CA-2018-144491,2018-04-01,Standard Class,CJ-12010,Caroline Jumper,Consumer,United States,Houston,Texas,77070.0,Central,FUR-CH-10004063,Furniture,Chairs,Global Deluxe High-Back Manager's Chair,600.5580,5 days,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-04-27,4638,CA-2018-168228,2018-04-29,First Class,AP-10915,Arthur Prichep,Consumer,United States,Los Angeles,California,90045.0,West,OFF-AR-10000390,Office Supplies,Art,Newell Chalk Holder,12.3900,2 days,2
2018-04-27,8638,CA-2018-151281,2018-05-02,Standard Class,HM-14980,Henry MacAllister,Consumer,United States,Seattle,Washington,98105.0,West,FUR-FU-10000397,Furniture,Furnishings,Luxo Economy Swing Arm Lamp,139.5800,5 days,5
2018-04-27,4639,CA-2018-168228,2018-04-29,First Class,AP-10915,Arthur Prichep,Consumer,United States,Los Angeles,California,90045.0,West,OFF-AR-10001725,Office Supplies,Art,Boston Home & Office Model 2000 Electric Penci...,47.3000,2 days,2
2018-04-27,3098,CA-2018-135692,2018-05-01,Standard Class,CV-12805,Cynthia Voltz,Corporate,United States,Fort Worth,Texas,76106.0,Central,OFF-LA-10001158,Office Supplies,Labels,"Avery Address/Shipping Labels for Typewriters,...",33.1200,4 days,4


In [141]:
#Calculating sales for a date slice
df.loc['2018-03-27': '2018-04-27', ['sales']]

Unnamed: 0_level_0,sales
order_date,Unnamed: 1_level_1
2018-03-27,39.9920
2018-03-27,11.9400
2018-03-27,1023.3320
2018-03-27,8.2600
2018-03-27,600.5580
...,...
2018-04-27,12.3900
2018-04-27,139.5800
2018-04-27,47.3000
2018-04-27,33.1200


In [142]:
#applying aggregation within a date slicing
print('Mean_sales_amt:', df.loc['2018-03-27': '2018-04-27', ['sales']].mean())
print('Spread_sales_amt:', df.loc['2018-03-27': '2018-04-27', ['sales']].std())

Mean_sales_amt: sales    194.771853
dtype: float64
Spread_sales_amt: sales    502.831668
dtype: float64


### Sorting data based on index vs Values and resetting index

`df.sort_index(ascending = False)`

`df.sort_values(by='sales')`

`df.reset_index()`

In [143]:
df.sort_index(ascending = True).head()

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2015-01-03,7981,CA-2015-103800,2015-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,Texas,77095.0,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,4 days,4
2015-01-04,742,CA-2015-112326,2015-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540.0,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,4 days,4
2015-01-04,741,CA-2015-112326,2015-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540.0,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,4 days,4
2015-01-04,740,CA-2015-112326,2015-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540.0,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,4 days,4
2015-01-05,1760,CA-2015-141817,2015-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,Pennsylvania,19143.0,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,7 days,7


In [144]:
df.sort_values(by = 'sales').head()

#this sorts the dataframe from the lowest sales to the highest

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-06-19,4102,US-2018-102288,2018-06-23,Standard Class,ZC-21910,Zuschuss Carroll,Consumer,United States,Houston,Texas,77095.0,Central,OFF-AP-10002906,Office Supplies,Appliances,Hoover Replacement Belt for Commercial Guardsm...,0.444,4 days,4
2018-03-02,9293,CA-2018-124114,2018-03-02,Same Day,RS-19765,Roland Schwarz,Corporate,United States,Waco,Texas,76706.0,Central,OFF-BI-10004022,Office Supplies,Binders,Acco Suede Grain Vinyl Round Ring Binder,0.556,0 days,0
2017-06-21,8659,CA-2017-168361,2017-06-25,Standard Class,KB-16600,Ken Brennan,Corporate,United States,Chicago,Illinois,60623.0,Central,OFF-BI-10003727,Office Supplies,Binders,Avery Durable Slant Ring Binders With Label Ho...,0.836,4 days,4
2015-03-31,4712,CA-2015-112403,2015-03-31,Same Day,JO-15280,Jas O'Carroll,Consumer,United States,Philadelphia,Pennsylvania,19120.0,East,OFF-BI-10003529,Office Supplies,Binders,Avery Round Ring Poly Binders,0.852,0 days,0
2015-09-26,2107,US-2015-152723,2015-09-26,Same Day,HG-14965,Henry Goldwyn,Corporate,United States,Mesquite,Texas,75150.0,Central,OFF-BI-10003460,Office Supplies,Binders,Acco 3-Hole Punch,0.876,0 days,0


In [145]:
df.tail(20)

Unnamed: 0_level_0,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
order_date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2018-12-28,271,CA-2018-163979,2019-01-02,Second Class,KH-16690,Kristen Hastings,Corporate,United States,San Francisco,California,94110.0,West,OFF-ST-10003208,Office Supplies,Storage,Adjustable Depth Letter/Legal Cart,725.84,5 days,5
2018-12-29,7635,US-2018-158526,2019-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,Kentucky,40214.0,South,OFF-AR-10003696,Office Supplies,Art,Panasonic KP-350BK Electric Pencil Sharpener w...,34.58,3 days,3
2018-12-29,1879,CA-2018-118885,2019-01-02,Standard Class,JG-15160,James Galang,Consumer,United States,Los Angeles,California,90049.0,West,TEC-PH-10002563,Technology,Phones,Adtran 1202752G1,302.376,4 days,4
2018-12-29,7636,US-2018-158526,2019-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,Kentucky,40214.0,South,FUR-CH-10004495,Furniture,Chairs,"Global Leather and Oak Executive Chair, Black",300.98,3 days,3
2018-12-29,7634,US-2018-158526,2019-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,Kentucky,40214.0,South,OFF-BI-10002414,Office Supplies,Binders,GBC ProClick Spines for 32-Hole Punch,12.53,3 days,3
2018-12-29,7637,US-2018-158526,2019-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,Kentucky,40214.0,South,FUR-CH-10001270,Furniture,Chairs,Harbour Creations Steel Folding Chair,258.75,3 days,3
2018-12-29,1878,CA-2018-118885,2019-01-02,Standard Class,JG-15160,James Galang,Consumer,United States,Los Angeles,California,90049.0,West,FUR-CH-10002880,Furniture,Chairs,"Global High-Back Leather Tilter, Burgundy",393.568,4 days,4
2018-12-29,7633,US-2018-158526,2019-01-01,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Louisville,Kentucky,40214.0,South,FUR-CH-10002602,Furniture,Chairs,DMI Arturo Collection Mission-style Design Woo...,1207.84,3 days,3
2018-12-29,5457,CA-2018-130631,2019-01-02,Standard Class,BS-11755,Bruce Stewart,Consumer,United States,Edmonds,Washington,98026.0,West,OFF-FA-10000089,Office Supplies,Fasteners,Acco Glide Clips,19.6,4 days,4
2018-12-29,2875,US-2018-102638,2018-12-31,First Class,MC-17845,Michael Chen,Consumer,United States,New York City,New York,10035.0,East,OFF-FA-10002988,Office Supplies,Fasteners,Ideal Clamps,6.03,2 days,2


In [146]:
df.reset_index().head()

Unnamed: 0,order_date,row_id,order_id,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales,delivery_time,delivery_time_days
0,2015-01-03,7981,CA-2015-103800,2015-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,Texas,77095.0,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,4 days,4
1,2015-01-04,742,CA-2015-112326,2015-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540.0,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,4 days,4
2,2015-01-04,741,CA-2015-112326,2015-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540.0,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,4 days,4
3,2015-01-04,740,CA-2015-112326,2015-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,Illinois,60540.0,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,4 days,4
4,2015-01-05,1760,CA-2015-141817,2015-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,Pennsylvania,19143.0,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,7 days,7
