Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Generally, follow the PEP8 guidelines.
Specific points to keep in mind (most of this is part of PEP8 already):
- Use the Calico document tools notebook plugin (installation instructions) to maintain consistent section numbering and an up-to-date table of contents at the beginning of each notebook
- Limit line length to 80 characters when reasonably possible
- Generally, add a docstring to functions that describes what the function does, especially if the function is more than a few lines (i.e. not easily self-describing). See this example of a short one-line docstring sufficient for simple functions.
loggingmodule to give status updates from your code, rather than
- Don't comment on obvious things, i.e. don't do this:
dataframe.plot() # Plot the dataframe
- Function and variable names should be all lowercase
Examples and recipes
Dealing with German number formats in pandas
When reading data, e.g. from a CSV file, pandas tries to automatically convert columns into the correct datatypes, i.e. parse numerical values.
object dtypes in a DataFrame are usually strings, suggesting that we probably want to check whether all numerical data have been parsed as either
int dtypes. Often, the quickest way to do is to set the
decimal arguments to
df = pd.read_csv(path_to_my_file, thousands='.', decimal=',')
An alternative approach is to manually process a column after reading the file, which is usually more appropriate in complex cases where multiple values need to be replaced or some other logic has to happen:
df['lat'] = df['lat'].str.replace('.', '').astype('float64')
Pandas provides a large number of vectorized (=fast) string methods via
str, see the documentation for a complete list.
WARNING: The logging module does not work with ipykernel 4.3.0 and 4.3.1. you can downgrade to version 4.2.2. :
conda install ipykernel=4.2.2
You can set the logging level at the beginning of a notebook to only print messages that are at that level or higher:
import logging logger = logging.getLogger('notebook') logger.setLevel('INFO')
Then log for example with
logger.error('An error occurred').
To add a timestamp to each log entry, add the following after the initialization code above:
nb_root_logger = logging.getLogger() formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s', datefmt='%d %b %Y %H:%M:%S') nb_root_logger.handlers.setFormatter(formatter)
By giving the logger a name ("notebook") and only setting the log level on that specific logger, logging output from other imported libraries is not displayed. In order to display all info-level logging output, not just that of the custom 'notebook' logger:
Complex configuration, e.g. long list of different URLs to download
To make this more readable, we can use
""" to write a multi-line YAML string and then parse it into a Python dict.
Note: if dates are given in the form
YYYY-MM-DD they are automatically parsed into datetime objects by the YAML parser, which simplifies later code -- if they should be treated as strings, they have to be wrapped as strings accordingly:
import yaml conf = """ Fingrid: wind: url_template: http://www.fingrid.fi/en/electricity-market/load-and-generation/windpower/Pages/default.aspx/Excel/TimeSeries.xls?beginDate=20150101&endDate=20151231&variables=H4sIAAAAAAAEAO29B2AcSZYlJi9tynt_SvVK1-B0oQiAYBMk2JBAEOzBiM3mkuwdaUcjKasqgcplVmVdZhZAzO2dvPfee--999577733ujudTif33_8_XGZkAWz2zkrayZ4hgKrIHz9-fB8_Ih7_ZFYX2aTMmyP769nyvDp6_CJb5EfPzn7_N1999fzsJ788--L49_99vv3m8V3-_PHZ8jKv25_MynV-dJ6VTf74rv_RY_7xrKoXWfu6rYvlxdGPj378x3ce3-1_8firZdG-yd-1R198d353_viu_fvxm6It86PvFstZuqqu8jq9yJd5nbVFtUy3uPWdx3el0ePnxTI_qcqqTu96f3y7XZRHP_7k5MnDHeo8_Pjx63l19eWyvH62btd1_jRrMzOWyDdA3aeP_bM5-n8AhQmq0kUBAAA1&cultureId=en-US&dataTimePrecision=5 bundle: YEARLY start: 2014-11-28 end: 2015-01-01 ... ... """ conf = yaml.load(conf)
Getting filenames from
Useful when the filename isn't part of the URL but we want to preserve it anyway:
import requests url = 'https://www.transnetbw.de/de/kennzahlen/erneuerbare-energien/windenergie?app=wind&activeTab=csv&selectMonatDownload=10&view=1&download=true' r = requests.get(url) # Here, file name is part of the ‘content-dispotision’ header from # which we simply extract it here filename = r.headers['content-disposition'].split('filename=')[-1]