## Dummy dataframe

For this project I need to create a dummy csv file that has a dataframe. It will consist of about 1000 rows and will contain relevant data and data with errors. This is because I need my parquet conversion program to clean the data before converting it to a parquet file 

In [1]:
import time
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [2]:
# setting random seed to 42 for consistence with random generated numbers
np.random.seed(42) 

In [3]:
# creating a dataframe with 1000 rows
df = pd.DataFrame(index=range(1000))

# Adding a Unix datetime column with one second increments
start_time = int(time.mktime(time.strptime("2023-01-31 00:00:00", "%Y-%m-%d %H:%M:%S")))

df["Timestamp"] = pd.to_numeric(range(start_time, start_time + 1000))

# added this data to reference the datetime for future columns. 
#df["Date_time"] = pd.to_datetime(df["Timestamp"], unit="s")

df.head()


Unnamed: 0,Timestamp
0,1675119600
1,1675119601
2,1675119602
3,1675119603
4,1675119604


In [11]:
time_points = len(df)

mean_speed = 10

speed_changes =  np.random.normal(loc=0, scale = 2, size=time_points)
speed_changes = np.clip(speed_changes, 0, 20).round(2)

df["speed_over_ground"] = mean_speed + speed_changes

df.head()

Unnamed: 0,Timestamp,speed_over_ground
0,1675119600,11.57
1,1675119601,10.0
2,1675119602,11.43
3,1675119603,10.0
4,1675119604,11.41


In [12]:
df["Latitude"] = np.random.uniform(low=-90, high=90, size=time_points)
df["Longitude"] = np.random.uniform(low=-180, high=180, size=time_points)

df.head()

Unnamed: 0,Timestamp,speed_over_ground,Latitude,Longitude
0,1675119600,11.57,-84.265783,96.124511
1,1675119601,10.0,-5.525461,159.110796
2,1675119602,11.43,76.432107,-102.172113
3,1675119603,10.0,-18.433464,97.734555
4,1675119604,11.41,22.486045,-172.721693


In [13]:
df["engine_fuel_rate"] = np.clip(df["speed_over_ground"]* 2, 0, 100)
df.head()

Unnamed: 0,Timestamp,speed_over_ground,Latitude,Longitude,engine_fuel_rate
0,1675119600,11.57,-84.265783,96.124511,23.14
1,1675119601,10.0,-5.525461,159.110796,20.0
2,1675119602,11.43,76.432107,-102.172113,22.86
3,1675119603,10.0,-18.433464,97.734555,20.0
4,1675119604,11.41,22.486045,-172.721693,22.82


In [22]:
df

Unnamed: 0,Timestamp,speed_over_ground,Latitude,Longitude,engine_fuel_rate
0,1675119600,11.57,-84.265783,96.124511,23.14
1,1675119601,10.00,-5.525461,159.110796,20.00
2,1675119602,11.43,76.432107,-102.172113,22.86
3,1675119603,10.00,-18.433464,97.734555,20.00
4,1675119604,11.41,22.486045,-172.721693,22.82
...,...,...,...,...,...
995,1675120595,14.42,-60.798677,-54.838173,28.84
996,1675120596,10.65,-34.212616,94.627972,21.30
997,1675120597,10.00,71.996990,38.869620,20.00
998,1675120598,13.31,33.871399,-143.069558,26.62


The next stage after creating this data frame is to add some bad data. This can be done in a few ways. I should have values that are not applicable in certain ranges within the selected column. Also I will add nan values within in each colum so my next program can handle them. Below are some examples on what I will apply on each column.

- Time stamp:   
    - remove one or more number so there is less then 10 digits
    - add nan value
    - add 0 value
    - add string value

- speed_over_ground 
    - add string value
    - add minus value
    - add positive number over 20
    - add nan value
    - values with more or less decimals

- Longitude and Latitude 
    - add string value
    - add minus values beyond the range of - 90 and -180
    - add positive values over 90 and 180
    - add string value
    - add nan value
    - add values with more or less than 2 decimal places

- engine_fuel_rate 
    - add string value
    - add minus values 
    - add value over range of 100
    - add nan value
    - add values with more or less than 2 decimal places