# Working with personal data requests


In [2]:
import pandas as pd
import numpy as np

### Your task

Import your own personal data file (or better yet, several of them!). Extract just the timestamp data. Export this timestamp data as a CSV file and save it on your machine.


### What if you have _other_ types of data?

A summary of pandas input/output (I/O) tools is [here](https://pandas.pydata.org/docs/user_guide/io.html). Here are some things we can directly import:
* CSVs
* JSON
* text (rtf, txt)
* HTML
* MS Excel
* Pickle (Python file format)
* Parquet
* And many others...

With HTML, you might want to parse it using Beautiful Soup.

For PDFs, you might need to use a tool like [Tabula](https://pypi.org/project/tabula-py/) (a Java tool, which runs in a Python wrapper) to extract the text from the PDF into a table format. _(Note: To use it, I had to install/update Java, and then install the Tabula package! A bit annoying, but not too bad...)_ Once I did that, though, it was fairly easy to run.

If you are having issues getting Tabula installed on your machine:
* make sure you try following the installation instructions [here](https://pypi.org/project/tabula-py/)
* you can also try running it in a Google Colab notebook, example [here](https://colab.research.google.com/github/chezou/tabula-py/blob/master/examples/tabula_example.ipynb)



In [3]:
#df = pd.read_csv("watch-history2.csv")


In [152]:
#df.head(10)

In [5]:
df_time = df.time

In [6]:
df_time


0        2023-04-19T22:04:48.331Z
1        2023-04-19T21:59:00.004Z
2        2023-04-19T21:58:37.669Z
3        2023-04-19T21:58:25.254Z
4        2023-04-19T03:35:12.207Z
                   ...           
19966    2019-04-10T12:00:53.076Z
19967    2019-04-09T21:34:21.036Z
19968    2019-04-07T21:27:52.689Z
19969    2019-04-07T21:18:22.625Z
19970    2019-04-07T12:30:57.931Z
Name: time, Length: 19971, dtype: object

In [None]:
# At the end, make sure you export your dataframe (containing timestamps) using .to_csv
# df.to_csv("filename.csv")

In [7]:
df_time.to_csv("timestamp.csv")

In [116]:
terence = pd.read_csv("terence-data.csv")

In [117]:
#The data was downloaded with the column number and timestamp in the same column, so I had to separate the two using excel.

In [118]:
terence

Unnamed: 0,",",endTime
0,0,6/6/2022 10:39
1,1,6/6/2022 10:43
2,2,6/6/2022 10:46
3,3,6/6/2022 10:49
4,4,6/6/2022 10:52
...,...,...
9994,9994,9/8/2022 7:25
9995,9995,9/8/2022 7:30
9996,9996,9/8/2022 7:33
9997,9997,9/8/2022 7:36


In [119]:
#https://www.geeksforgeeks.org/delete-a-column-from-a-pandas-dataframe/
del terence[","]

In [120]:
terence

Unnamed: 0,endTime
0,6/6/2022 10:39
1,6/6/2022 10:43
2,6/6/2022 10:46
3,6/6/2022 10:49
4,6/6/2022 10:52
...,...
9994,9/8/2022 7:25
9995,9/8/2022 7:30
9996,9/8/2022 7:33
9997,9/8/2022 7:36


In [121]:
#transform it to a date format
#terence["date"] = terence["endTime"].dt.date

In [148]:
#Already in YY/MM/DD format, so we don;t have to transform it
terence['int_endTime'] = pd.to_datetime(terence["endTime"])

In [149]:
terence

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,endTime,int_endTime
0,0,0,2022-06-06 10:39:00,2022-06-06 10:39:00
1,1,1,2022-06-06 10:43:00,2022-06-06 10:43:00
2,2,2,2022-06-06 10:46:00,2022-06-06 10:46:00
3,3,3,2022-06-06 10:49:00,2022-06-06 10:49:00
4,4,4,2022-06-06 10:52:00,2022-06-06 10:52:00
...,...,...,...,...
9994,9994,9994,2022-09-08 07:25:00,2022-09-08 07:25:00
9995,9995,9995,2022-09-08 07:30:00,2022-09-08 07:30:00
9996,9996,9996,2022-09-08 07:33:00,2022-09-08 07:33:00
9997,9997,9997,2022-09-08 07:36:00,2022-09-08 07:36:00


In [142]:
# I had issues displaying the dataframe as a table after transforming the values into a date format. So I had to remake a new csv file with the new date format.
terence.to_csv("mahlatini.csv")

In [143]:
terence = pd.read_csv("mahlatini.csv")

In [144]:
terence

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,endTime
0,0,0,2022-06-06 10:39:00
1,1,1,2022-06-06 10:43:00
2,2,2,2022-06-06 10:46:00
3,3,3,2022-06-06 10:49:00
4,4,4,2022-06-06 10:52:00
...,...,...,...
9994,9994,9994,2022-09-08 07:25:00
9995,9995,9995,2022-09-08 07:30:00
9996,9996,9996,2022-09-08 07:33:00
9997,9997,9997,2022-09-08 07:36:00


In [151]:
#Timestamps already sorted in chronological order, I can directly make the new column providing the difference between a timestamp and a previous one.
#terence["difference"] = terence.diff(periods=1, axis=0)
time=terence['int_endTime'].dt.time
time
terence["difference"] = terence["endTime"].diff()
terence.head(100)

TypeError: unsupported operand type(s) for -: 'str' and 'str'

In [146]:
#Checking the variabkle type.
terence.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9999 entries, 0 to 9998
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0.1  9999 non-null   int64 
 1   Unnamed: 0    9999 non-null   int64 
 2   endTime       9997 non-null   object
dtypes: int64(2), object(1)
memory usage: 234.5+ KB
