In [None]:
import pandas as pd

## Part 1: Dataset Questions

#### 1

From the first line of the dataset:

```
LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
```

Since this is a .csv file, you would normally expect a comma to be used as the separator. However, a | character is used instead. This is because the textual columns contain commas already. If a comma was also used as a separator, it would conflict with the commas in the text and cause an inconsistent number of columns in each row.

#### Note for the curious

There is a specific formatting of CSV files that would allow commas to be used as separators. Python and Pandas are both capable of reading and writing this format. If it were used, the line above would look like this instead:

```
LJ001-0001,"Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition","Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition"
```

The Python documentation offers a little discussion around this here: https://docs.python.org/3/library/csv.html. In particular, look for the `quoting` argument to some CSV-related functions, which Pandas also accepts as an argument: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html.

The material in this note will not be on the exams and is just for extra context.

#### 2

If you searched for LJ Speech, you might have found its website: https://keithito.com/LJ-Speech-Dataset/

LJ Speech contains short recordings of one person reading 7 different non-fiction books. The file given as part of the assignment is included in the dataset, and it contains transcriptions of each recording.

It is often used in speech research, particularly speech recognition and text-to-speech (i.e., speech synthesis).

#### 3

In `text_norm`, numeric values are spelled out.

In [None]:
# 4
df = pd.read_csv(
    "https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/ljspeech.csv",
    sep="|"
)

# This code will retrieve the first 5 rows with different text and text_norm values.
# Many submissions included additional code to present the results in an easier-
# to-read way.
df[df['text'] != df['text_norm']].head()

Unnamed: 0,id,text,text_norm
6,LJ001-0007,"the earliest book printed with movable types, ...","the earliest book printed with movable types, ..."
23,LJ001-0024,But the first Bible actually dated (which also...,But the first Bible actually dated (which also...
30,LJ001-0031,In 1465 Sweynheim and Pannartz began printing ...,In fourteen sixty-five Sweynheim and Pannartz ...
33,LJ001-0034,"They printed very few books in this type, thre...","They printed very few books in this type, thre..."
37,LJ001-0038,while in 1470 at Paris Udalric Gering and his ...,while in fourteen seventy at Paris Udalric Ger...


## Part 2: Cameras

In [None]:
camera_df = pd.read_csv("https://raw.githubusercontent.com/CUNY-CISC-3225/datasets/main/cameras.csv")
camera_df.head()

Unnamed: 0,Model,Release date,Max resolution,Low resolution,Effective pixels,Zoom wide (W),Zoom tele (T),Normal focus range,Macro focus range,Storage included,Weight (inc. batteries),Dimensions,Price,Brand
0,Canon PowerShot 350,1997,640.0,0.0,0.0,42.0,42.0,70.0,3.0,2.0,320.0,93.0,149.0,Canon
1,Canon PowerShot 600,1996,832.0,640.0,0.0,50.0,50.0,40.0,10.0,1.0,460.0,160.0,139.0,Canon
2,Canon PowerShot A10,2001,1280.0,1024.0,1.0,35.0,105.0,76.0,16.0,8.0,375.0,110.0,139.0,Canon
3,Canon PowerShot A100,2002,1280.0,1024.0,1.0,39.0,39.0,20.0,5.0,8.0,225.0,110.0,139.0,Canon
4,Canon PowerShot A20,2001,1600.0,1024.0,1.0,35.0,105.0,76.0,16.0,8.0,375.0,110.0,139.0,Canon


In [None]:
# 1: What is the average cost of all the cameras?
camera_df['Price'].mean()

454.94093264248704

In [None]:
# 2: What is the brand and model name of the most expensive camera? There may be more than one.
camera_df[camera_df['Price'] == camera_df['Price'].max()]['Model']

45             Canon EOS-1Ds
46     Canon EOS-1Ds Mark II
47    Canon EOS-1Ds Mark III
Name: Model, dtype: object

In [None]:
# 3: Which are more common, below-average-cost cameras or above-average-cost cameras?
# There are more below-average-cost cameras.
print("Below-average-cost cameras:", len(camera_df[camera_df['Price'] < camera_df['Price'].mean()]))
print("Above-average-cost cameras", len(camera_df[camera_df['Price'] > camera_df['Price'].mean()]))

Below-average-cost cameras: 812
Above-average-cost cameras 153


In [None]:
# 4: Digital camera technology improved rapidly during 1997-2007 period.
#    Divide the DataFrame into two new DataFrames: one with cameras released on
#    or before 2002, and one with cameras released after 2002.
#
# From below, we see:
#
# 4a. Resolution, zoom tele, storage, weight, dimensions, and price improved.
# 4b. Zoom wide did not change by much.
# 4c. Normal focus range and macro focus range got worse.
#
# Note that answers above vary and depend on how strict or interpretive
# you want to be with the differences shown below. However, note that
# price and weight decreased over time, but that is a good thing and
# constitutes an improvement.
camera_before_2002_df = camera_df[camera_df['Release date'] <= 2002]
camera_after_2002_df = camera_df[camera_df['Release date'] > 2002]

print("Before 2002:")
print(camera_before_2002_df.mean())
print()

print("After 2002:")
print(camera_after_2002_df.mean())

Before 2002:
Release date               2000.294118
Max resolution             1764.761246
Low resolution             1050.408304
Effective pixels              1.896194
Zoom wide (W)                33.695502
Zoom tele (T)               103.653979
Normal focus range           46.633218
Macro focus range            10.055363
Storage included             11.141869
Weight (inc. batteries)     430.249135
Dimensions                  116.330450
Price                       525.636678
dtype: float64

After 2002:
Release date               2005.244083
Max resolution             2837.696746
Low resolution             2156.576923
Effective pixels              5.970414
Zoom wide (W)                32.495562
Zoom tele (T)               132.650888
Normal focus range           42.531065
Macro focus range             6.575444
Storage included             20.986686
Weight (inc. batteries)     276.736686
Dimensions                  101.517751
Price                       424.717456
dtype: float64


  print(camera_before_2002_df.mean())
  print(camera_after_2002_df.mean())


#### 5

There are two ways to approach this answer.

If you computed the means as shown in th call above, non-numeric columns were not included in the output because we cannot compute the mean of them.

However, some students noted the warning in the cell output, which is an alert from Pandas that functionality is going away in the future. Some students running Pandas locally with more recent versions encountered an exception.

If this applies to you, then Pandas has removed the ability to implicitly ignore non-numeric columns in this situation, and expects you to specify which columns you want to average. If you specify a non-numeric column, Pandas will raise an exception.

In [None]:
# 6
camera_df['price_to_pixel'] = camera_df['Price'] / camera_df['Effective pixels']
camera_df.loc[
    camera_df['price_to_pixel'] == camera_df['price_to_pixel'].min(),
     ['Model', 'Price', 'Effective pixels', 'price_to_pixel']
]

Unnamed: 0,Model,Price,Effective pixels,price_to_pixel
362,Kodak DCS 14n,129.0,13.0,9.923077
374,Kodak DCS SLR/c,129.0,13.0,9.923077
375,Kodak DCS SLR/n,129.0,13.0,9.923077


In [None]:
# 7
# There are several ways to find this out. Here, we call .info() on
# the DataFrame. From this, we see all columns have 965 non-null values,
# which comprises the entire dataset.
camera_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 965 entries, 0 to 964
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Model                    965 non-null    object 
 1   Release date             965 non-null    int64  
 2   Max resolution           965 non-null    float64
 3   Low resolution           965 non-null    float64
 4   Effective pixels         965 non-null    float64
 5   Zoom wide (W)            965 non-null    float64
 6   Zoom tele (T)            965 non-null    float64
 7   Normal focus range       965 non-null    float64
 8   Macro focus range        965 non-null    float64
 9   Storage included         965 non-null    float64
 10  Weight (inc. batteries)  965 non-null    float64
 11  Dimensions               965 non-null    float64
 12  Price                    965 non-null    float64
 13  Brand                    965 non-null    object 
 14  price_to_pixel           9

In [None]:
# 8
# There are several ways to approach investigating this. If you output the
# DataFrame and look at the first few rows, you'll see something unusual: the
# row at index 0 has a value of 0.0 for both Low Resolution and Effective Pixels.
# Technically, the "Effective pixels" colum likely uses megapixels as units, but
# even then it's a little strange to see 0 megapixels and not a decimal number.
camera_df

Unnamed: 0,Model,Release date,Max resolution,Low resolution,Effective pixels,Zoom wide (W),Zoom tele (T),Normal focus range,Macro focus range,Storage included,Weight (inc. batteries),Dimensions,Price,Brand,price_to_pixel
0,Canon PowerShot 350,1997,640.0,0.0,0.0,42.0,42.0,70.0,3.0,2.0,320.0,93.0,149.0,Canon,inf
1,Canon PowerShot 600,1996,832.0,640.0,0.0,50.0,50.0,40.0,10.0,1.0,460.0,160.0,139.0,Canon,inf
2,Canon PowerShot A10,2001,1280.0,1024.0,1.0,35.0,105.0,76.0,16.0,8.0,375.0,110.0,139.0,Canon,139.0
3,Canon PowerShot A100,2002,1280.0,1024.0,1.0,39.0,39.0,20.0,5.0,8.0,225.0,110.0,139.0,Canon,139.0
4,Canon PowerShot A20,2001,1600.0,1024.0,1.0,35.0,105.0,76.0,16.0,8.0,375.0,110.0,139.0,Canon,139.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
960,Sony Mavica FD-88,1999,1280.0,1024.0,1.0,34.0,270.0,4.0,4.0,1.0,600.0,142.0,1399.0,Sony,1399.0
961,Sony Mavica FD-90,2000,1280.0,1280.0,1.0,34.0,270.0,4.0,4.0,1.0,0.0,0.0,1399.0,Sony,1399.0
962,Sony Mavica FD-92,2001,1472.0,1280.0,1.0,41.0,328.0,25.0,3.0,1.0,660.0,126.0,1399.0,Sony,1399.0
963,Sony Mavica FD-95,2000,1600.0,1600.0,1.0,37.0,370.0,80.0,3.0,1.0,900.0,0.0,1399.0,Sony,1399.0


In [None]:
# If we investigate this further and check the minimum value in each column, we
# see something weird: the minimum value for most columns is 0.0.
# Sometimes this makes sense - for example, for Storage Included, maybe the camera
# does not have any onboard memory and requires you to purchase a memory card.
# In other situations, it is more suspicious - how can a camera have weight and
# dimensions of 0? How can the resolution be 0? How can it not zoom or have any
# focus range?
#
# There is unfortunately little information about the creator's intentions,
# so it is difficult to tell if these are truly missing values or if the creator
# simply rounded so that small values became 0. But in either case, 0s don't
# make sense for some of these columns. If a rounding down happened, maybe it won't
# affect our calculations much. But if the 0s indicate were missing values, it
# would definitely affect calcultions involving them.
camera_df.min()

Model                      Canon EOS 10D
Release date                        1994
Max resolution                       0.0
Low resolution                       0.0
Effective pixels                     0.0
Zoom wide (W)                        0.0
Zoom tele (T)                        0.0
Normal focus range                   0.0
Macro focus range                    0.0
Storage included                     0.0
Weight (inc. batteries)              0.0
Dimensions                           0.0
Price                               99.0
Brand                              Canon
price_to_pixel                  9.923077
dtype: object

## Part 3: Pandas Questions

1. A Series is equivalent to a one-dimensional array. It contains values all of the same type. A DataFrame is equivalent to a two-dimensional array. It is a table of values with rows and columns, where all values a column are of the same type.

2. Indices allow you to access values in a Series because each element is associated with an index value. In a DataFrame, an entire row is associated with an index, so you can retrieve rows by their index value. Additionally, columns are an index, and you can access the values in entire columns by their name.

3. On Series objects, loc allows you to retrieve single elements or ranges of elements by their index value. iloc allows you to retrieve single elements or ranges of elements by their position in the Series.

4. On DataFrame objects, loc and iloc work the same way as they do for Series objects. However, they can retrieve both rows and columns by index (loc) or position (iloc).

5. A Series is like a dictionary because individual elements can be accessed by their index value, which is similar to how a dictionary creates an association between keys and values. A Series is also similar to a one-dimensional array because its elements are stored in a specific order.

6. A DataFrame is like a dictionary because it creates associations between keys and values, in this case keys being column and/or row indices accessing columns and/or rows. It is also like a two-dimensional array because the row and column format requires managing data along two dimensions. Like 2D arrays, the rows and columns occupy a specific position inside the DataFrame.